# Web Scraping

## Learning Outcomes:
- Learn the structure of HTML
- Learn how to use XPath to navigate HTML (via lxml)
- Use Selenium to scrape data from websites

One of the most common ways to obtain data is through the use of **web scraping**. Web scraping, as the name suggests, is about pulling information from websites in a programmatic fashion... (because copy and pasting would be way too much effort)

## The challenge

Let's say we wanted to build a model which would predict house prices given some features - for example, location, number of bedrooms, number of bathrooms. We need some way of obtaining this data - both the response and the target variables.

To introduce you to the concept of web scraping, let's try and extract data for 100 houses:
- **Sale Price**: Our response variable
- Number of bedrooms
- Square footage
- Description
- Address
    
[This URL shows houses listed for sale in London](https://www.zoopla.co.uk/new-homes/property/london/?q=London&results_sort=newest_listings&search_source=new-homes&page_size=25&pn=1&view_type=list). Let's take a look at where the information that we want to extract is on the webpage.

Before we look at solving this challenge, let's take a look at what websites and HTML actually are.

## Websites

### What format does information on a website exist in?

We know that websites don't just print data in a nice CSV or JSON format. 
They have content to display stuff to you in a way that makes sense, like buttons, on the page. 
This content is defined in a HTML file.

They also have styling

#### What is HTML?

HTML stands for HyperText Markup Language. It consists of a tree structure of different types of web elements, like buttons, page divisions, images and more. This means that it is used to define what **content** is rendered on any webpage that you visit.

HTML markdown contains elements/tags that may contain other elements/tags.


[Let's play around with some HTML](https://code.sololearn.com/WoNr8gIeKYDr/)

### How can we get the website HTML, which contains data that we want?

When you search for a URL in a browser, here's what happens:
- your browser makes a **GET request** to the computer (server) that serves requests from that URL endpoint
- this computer knows what web content to send you back, so it sends it in a response to the request. This stuff includes the HTML of the page that you want to view.
- Your browser gets the HTML, and knows how to present that type of data to you (it renders the webpage)

The point here being that you can get the HTML, which defines the content for any site, by making a GET request to that website.

Let's try that!

We can use the requests library to get the HTML from a website

In [1]:
import requests # import the requests library
r = requests.get('http://pythonscraping.com/pages/page3.html') # make a HTTP GET request to this website
html_string = r.text # the text attribute of this response is the HTML as a string
print(r.text)

<html>
<head>
<style>
img{
	width:75px;
}
table{
	width:50%;
}
td{
	margin:10px;
	padding:10px;
}
.wrapper{
	width:800px;
}
.excitingNote{
	font-style:italic;
	font-weight:bold;
}
</style>
</head>
<body>
<div id="wrapper">
<img src="../img/gifts/logo.jpg" style="float:left;">
<h1>Totally Normal Gifts</h1>
<div id="content">Here is a collection of totally normal, totally reasonable gifts that your friends are sure to love! Our collection is
hand-curated by well-paid, free-range Tibetan monks.<p>
We haven't figured out how to make online shopping carts yet, but you can send us a check to:<br>
123 Main St.<br>
Abuja, Nigeria
</br>We will then send your totally amazing gift, pronto! Please include an extra $5.00 for gift wrapping.</div>
<table id="giftList">
<tr><th>
Item Title
</th><th>
Description
</th><th>
Cost
</th><th>
Image
</th></tr>

<tr id="gift1" class="gift"><td>
Vegetable Basket
</td><td>
This vegetable basket is the perfect gift for your health conscious (or overweight) friend

# BeautifulSoup

What we saw above only gives us the HTML, but we want to be able to extract the data from it. After requesting the data from the webpage, we obtain a string of HTML, but looking for some specific data is a bit of a pain. We can use the **BeautifulSoup** library to extract the data from the HTML looking for specific tags and their attributes.

In [3]:
import requests
from bs4 import BeautifulSoup
page = requests.get('http://pythonscraping.com/pages/page3.html')
html = page.text # Get the content of the webpage
soup = BeautifulSoup(html, 'html.parser') # Convert that into a BeautifulSoup object that contains methods to make the tag searcg easier
print(soup.prettify())

<html>
 <head>
  <style>
   img{
	width:75px;
}
table{
	width:50%;
}
td{
	margin:10px;
	padding:10px;
}
.wrapper{
	width:800px;
}
.excitingNote{
	font-style:italic;
	font-weight:bold;
}
  </style>
 </head>
 <body>
  <div id="wrapper">
   <img src="../img/gifts/logo.jpg" style="float:left;"/>
   <h1>
    Totally Normal Gifts
   </h1>
   <div id="content">
    Here is a collection of totally normal, totally reasonable gifts that your friends are sure to love! Our collection is
hand-curated by well-paid, free-range Tibetan monks.
    <p>
     We haven't figured out how to make online shopping carts yet, but you can send us a check to:
     <br/>
     123 Main St.
     <br/>
     Abuja, Nigeria
We will then send your totally amazing gift, pronto! Please include an extra $5.00 for gift wrapping.
    </p>
   </div>
   <table id="giftList">
    <tr>
     <th>
      Item Title
     </th>
     <th>
      Description
     </th>
     <th>
      Cost
     </th>
     <th>
      Image
     </th>
   

Let's see an example using the following [URL](http://pythonscraping.com/pages/page3.html) `'http://pythonscraping.com/pages/page3.html'`

In that webpage you will find a small list with a set of items. Let's say that you want to extract the data from the Fish Painting.

<p align="center">
  <img src='images/BS4_1.png' width=500>
</p>


In your browser, you can see that the HTML for the page is in the `<body>` tag. Let's see how we can extract the data from this HTML. In the page, right-click on the `<body>` tag and select **Inspect Element**. There, you will see the HTML for the page, and you can see that the Fish Painting is in a `<tr>` tag.

<p align="center">
  <img src='images/BS4_2.png' width=500>
</p>

You can find that tag using the method `find` that accepts the tag name, and the attributes of said tag

In [None]:
fish = soup.find(name='tr', attrs={'id': 'gift3', 'class': 'gift'}) # If it doesn't find anything it returns None

print(fish)

Inside the `tr` tag, you will find different `<td>` tags. You can find all the `<td>` tags using the method `find_all` that accepts the tag name and the attributes of said tag.

In [None]:
fish_row = fish.find_all('td') # This returns a list where each item corresponds to each td tag 

Now, you obtained a list where each element correponds to the data for each column. Thus, you can index the list to get the data you want.

In [None]:
title = fish_row[0].text
description = fish_row[1].text
price = fish_row[2].text

print(title)
print(description)
print(price)

You can keep looking for more data in the tree. For example, you can look for the parrot row taking into account that it is the sibling of the fish row.

In [None]:
parrot = fish.find_next_sibling()

And you can also find the parrot's children using the method `findChildren`:

In [None]:
parrot_children = parrot.findChildren()

## Try it out

### What is a Method?


It is time to apply what you have learned so far about BeautifulSoup. Go to the following page, and look information about the `Methods` section. [https://en.wikipedia.org/wiki/Python_(programming_language)](https://en.wikipedia.org/wiki/Python_(programming_language))

You only need to extract the text from that section.

<p align="center">
  <img src='images/BS4_3.png' width=500>
</p>

_Tip_: The `p` tag containing the text does not have any attributes. Try looking at the `h3` tag before it, or the `a` child tag. From there, you can start moving around.

# Selenium


<a href=https://selenium-python.readthedocs.io/><p align=center><img src=images/selenium_logo.webp width=400></p></a>

Selenium is a tool for programmatically controlling a browser. It's originally intended to be used for creating unit tests, but it can also be used to do anything that needs a browser to be controlled. Click the logo to go to the Selenium documentation!

## Webdriving

Selenium can "drive" a web browser. This means it can take full control of it and, find elements, click, scroll, execute js etc.

You need to specify which browsers this webdriver will drive such as Chrome or Firefox. To drive a browser you need to have the driver installed. We'll use the chrome browser and download it's driver called Chromedriver.

We'll have to install chromedriver to drive our chrome browser. You should ensure you have the correct version, which should be the same as the version of chrome which you wish to drive. 

[Check your chrome version here](https://help.zenplanner.com/hc/en-us/articles/204253654-How-to-Find-Your-Internet-Browser-Version-Number-Google-Chrome)

[Download chromedriver from here](https://chromedriver.chromium.org/downloads)

If you are using FireFox, you need to download the geckodriver. You can download it from [here](https://github.com/mozilla/geckodriver/releases)

If you are using Edge, you need to download the MicrosoftWebDriver. You can download it from [here](https://developer.microsoft.com/en-us/microsoft-edge/tools/webdriver/)

If you are using Safari, you can go to your browser, go to `Developer`, and set the "Allow Remote Automation" to "Yes".

Once you download the driver, you can move it to the Python Path. This will make things easier when you have to work in different directories. To move it to the path, you can do the following command:

1. Observe your `PATH` environment variable by running `echo $PATH`:

In [None]:
%%bash
echo $PATH

In my case, this is the PATH I get:
<p align=center><img src=images/Path.png></p>

This means that Python will look into any of these paths to find the packages, dependencies... and the webdriver to execute selenium. You can tell apart the Paths because they are separated by a colon (:). So, all the paths in this example are:
- `/Library/Frameworks/Python.framework/Versions/3.9/bin`
- `/opt/miniconda3/bin`
- `/opt/miniconda3/condabin`
- `/usr/local/bin`
- `/usr/bin`
- `/bin`
- `/usr/sbin`
- `/sbin`



2. Move your driver file to any of the directories in your `PATH` environment variable. For example, if you are using `/usr/local/bin` as your `PATH` environment variable, you can move the driver to `/usr/local/bin` by running `mv chromedriver /usr/local/bin`. Make sure to replace `chromedriver` with the name of your driver.


Finally, you need to install selenium, run this cell to install it, but make sure you are in the right environment! (In VSCode, if you are using a notebook, look at the top right, and if you are using a script, look at the bottom left)

In [None]:
%%bash
pip install selenium

And now you are ready to use selenium! 

IMPORTANT! If you are reading this in a Google Colab notebook, you can still use Selenium, but you need to enable the `headless` mode (we will see that later). This essentially means that the browser will not open, and you are not going to see how Selenium is executing all the commands. For the sake of practice, we encourage you to follow these steps in a notebook in your local machine. Once you get more practice, you can start using selenium on the cloud

## Finding tree elements within a HTMLElement using XPath

Selenium finds the elements of a website by looking at its HTML code. You can navigate through this code by using XPaths. 

Xpath is a query language for selecting nodes/branches/elements within a tree-like data structure like HTML or XML.
Below is a very simple xpath expression. This one finds all of the button elements in the html

`//button`

The `//` says "anywhere in the tree" and the button says find elements that have the tag type button. So this xpath expression says "find button tags anywhere within the tree"
The xpath method of HTMLElement takes in an xpath expression returns a list of all elements in the tree that match it.
Below are more examples of how to use xpath:

- `/button` find direct children (not all) tags of type button, of the element
- `//div/button` - finds all of the button tags inside div tags anywhere on the page
- `//div[@id='custom_id']` - finds all div tags with the attribute (@) id equal to custom_id, anywhere on the page

If any of these don't make sense, let us know after looking it up.
Use the `//button` xpath expression as an argument to find the button on the page

Just as a taster, if you want to use selenium to find the element corresponding to the button xpath, you can write it like this:
```
from selenium import webdriver 
driver = webdriver.Chrome()
driver.find_element_by_xpath('//button')
```

The first line import the webdriver library that contains the different types of webdriver

The second line assign a chromedriver to a variable `driver`. This is the instance that will help us navigate through the HTML code in the website

The last line will find the elements in the HTML code that correspond to that XPath. If there is no element with that XPath, selenium will throw an error

### Using the browser console



Modern browsers come with tools to maximise web developers productivity and help find bugs.

The developer console has a lot of different tools. 

Open your element inspector by pressing `CTRL + SHIFT + C`.
It should open on the right hand side of your screen as shown below.

The elements tab of the developer console shows you the HTML and CSS that make up the website code (actually it shows the DOM. Read more about what exactly the DOM is [here](https://css-tricks.com/dom/)).

You can always close the developer console by clicking the cross in the corner. 

Check out the zoopla website for yourself. Try using your selector to see the HTML structure of the page.
<p align=center><img src=images/form_selector.png width=600></p>

Now use your selector to find the location of the button as shown below.

<p align=center><img src=images/button_selector.png width=600></p>

As mentioned, the selector allows us to visualise the DOM and find elements within our webpage.


### Challenge: How many HTML buttons are there on the homepage? 

## Relative XPaths


> <font size=+1> We can find elements, and then search for elements within them! </font>

Elements returned from finding them by xpath also have the same search methods. For example, if you have the following HTML code:

<p align=center><img src=images/HTML_example.png width=400></p>

The xpath of the highlighted element is `//div[@id="__next"]`. Once again, this xpath means:

- `//` indicates that it will look into the whole tree
- `div` indicates that it will look only for "div" tags
- `[]` whatever we write inside, is going to correspond to the attributes of the tag we look for
- `[@id="__next"]` means that the tag we look for has an attribute whose value is "__next"

Thus, the whole xpath means: In the whole tree, find a div tag whose id attribute is equal to "__next"

So, let's say that we assign that xpath to a variable `my_path`
```
my_path = driver.find_element_by_xpath('//*[@id="__next"]')
```
If, after that, we wanted to find the inner "div" tag, we don't need to specify the whole xpath. We can refer to `my_path` and start the search from that point. This is also known as "relative xpath"

To start the search from a certain point, we just need to use a dot (`.`), so, to find the next "div" tag, we can write this:
```
new_path = my_path.find_element_by_xpath('./div')
```
And that's it!

Notice that in this case, we only used a single slash. That means that we are going to look for that element only in the direct children (but not in the grandchildren)

## Beyond just GETTING static HTML


### Why might using requests to get the website content not work?

Some elements on webpages are inserted or manipulated by javascript code that runs only after the HTML is rendered.

Some information that you want may be shown only after interacting with certain elements.

The GET requests to the website just get the HTML file. They don't actually run the javascript code, or interact with the page after it renders. So parsing them for our data won't work.

Again, there is a way around this. We can Selenium to take control of a browser that can then be programatically instructed to fill in forms, click elements, and find data on any webpage.

## Using Selenium

To use selenium, we need to create an instance that is going to "drive" us through the webpage

In [1]:
from selenium import webdriver
from time import sleep

driver = webdriver.Chrome()
driver.get("https://zoopla.co.uk")

Cool! We see that we've navigated to the Zoopla.co.uk website. We can search for elements via `xpath` and can also send mouse and keyboard actions through Selenium as well. Let's recall the challenge we want to solve - extracting data for 50 houses:
- **Sale Price**: Our response variable
- Number of bedrooms
- Square footage
- Description
- Address

We'll focus our efforts just in the London area the next cell will take us to the URL corresponding to properties in London

In [2]:
driver = webdriver.Chrome() 
URL = "https://www.zoopla.co.uk/new-homes/property/london/?q=London&results_sort=newest_listings&search_source=new-homes&page_size=25&pn=1&view_type=list"
driver.get(URL)

Oh... Looks like cookies are blocking us... We need to find a way to get around this 🤔. Let's start by using xpath to find the "Accept All Cookies" button

UPDATE: The zoopla website has a frame in the website. The 'Accept Cookies' is in this frame, so we have to tell selenium to access the frame. Usually, if it doesn't have a frame, you can ignore the `switch_to_frame` method

In [7]:
from selenium import webdriver
import time

driver = webdriver.Chrome() 
URL = "https://www.zoopla.co.uk/new-homes/property/london/?q=London&results_sort=newest_listings&search_source=new-homes&page_size=25&pn=1&view_type=list"
driver.get(URL)
time.sleep(2) # Wait a couple of seconds, so the website doesn't suspect you are a bot
try:
    driver.switch_to_frame('gdpr-consent-notice') # This is the id of the frame
    accept_cookies_button = driver.find_element_by_xpath('//*[@id="save"]')
    accept_cookies_button.click()

except AttributeError: # If you have the latest version of selenium, the code above won't run because the "switch_to_frame" is deprecated
    driver.switch_to.frame('gdpr-consent-notice') # This is the id of the frame
    accept_cookies_button = driver.find_element_by_xpath('//*[@id="save"]')
    accept_cookies_button.click()

except:
    pass # If there is no cookies button, we won't find it, so we can pass

  driver.switch_to_frame('gdpr-consent-notice') # This is the id of the frame


Wow, have you seen that? The webdriver went to the website and it clicked the button for us! So, analyze the methods we used:
- `find_element_by_xpath()` To make the driver point to the element 
- `click()` To make the driver click on the element that was pointed

Alright, so it is time to start extracting the data we are interested on. Let's extract the price, address, number of bedrooms and the description

First of all, observe the HTML code corresponding to a property:
<p align=center><img src=images/Selenium_1.png width=900></p>

If you get the XPath of that property, it will look like this:

`//*[@id="listing_60212639"]`

Which is fine if we want to find a single property, but not so great if we want to list all the properties in that page. We will focus on how to get all the properties shortly, for now, let's extract the URL of that property, and extract the information we need. 

__IMPORTANT! Zoopa is constantly adding new properties, it is likely that the Xpath changed, so make sure that you are following all the steps and using the correct XPath__

Let's take a look again at the HTML code, you will notice that there are some `<a>` tags in the HTML code. Usually, these tags are used to include a hyper reference (`href`). Selenium allows us to get that href, but first we need to locate the `<a>` tag containing the href.

So, if you expand one of the `<div>` tags corresponding to a property, you will see something like this:

<p align=center><img src=images/Selenium_2.png width=900></p>

Can you see the `<a>` tag? That is the tag that contains the URL we need. So, let's tell selenium to extract it

In [12]:
from selenium import webdriver
import time

driver = webdriver.Chrome() 
URL = "https://www.zoopla.co.uk/new-homes/property/london/?q=London&results_sort=newest_listings&search_source=new-homes&page_size=25&pn=1&view_type=list"
driver.get(URL)
time.sleep(2) # Wait a couple of seconds, so the website doesn't suspect you are a bot
try:
    driver.switch_to_frame('gdpr-consent-notice') # This is the id of the frame
    accept_cookies_button = driver.find_element_by_xpath('//*[@id="save"]')
    accept_cookies_button.click()

except AttributeError: # If you have the latest version of selenium, the code above won't run because the "switch_to_frame" is deprecated
    driver.switch_to.frame('gdpr-consent-notice') # This is the id of the frame
    accept_cookies_button = driver.find_element_by_xpath('//*[@id="save"]')
    accept_cookies_button.click()

except:
    pass
time.sleep(2)
property = driver.find_element_by_xpath('//*[@id="listing_60212639"]') # Change this xpath with the xpath the current page has in their properties
a_tag = property.find_element_by_tag_name('a')
link = a_tag.get_attribute('href')
print(link)


  driver.switch_to_frame('gdpr-consent-notice') # This is the id of the frame


https://www.zoopla.co.uk/new-homes/details/60212639/


Nice, now we can visit that link using selenium. Alternatively, you can also click on the `property` element (`property.click()`) and it will take you to the same page. But you will have to:
- Click the element
- Sleep
- Extract the information
- Go back
- Sleep
- Find the next property 
- Click
- Sleep

On the other hand, if you have the links, you can visit them like this:

- Extract all the links
- Iterate through the list, and for each iteration, visit the corresponding URL
- Sleep
- Extract the information of the property
- Visit the next URL

So, it's up to you, but for many different websites, creating a list with links (which is usually called "crawler"), is much more efficient

Enough talking (or writing), let's visit the link we extracted

In [13]:
driver.get(link)

And it moved us to the webpage of that property

<p align=center><img src=images/Selenium_3.png width=900></p>

There, you can see the price, address, number of bedrooms, and the description. As always, let's take a look at the XPath corresponding to each property

<p align=center><img src=images/Selenium_4.png width=900></p>

And there it is, if you do the same with the number of bedrooms, the address and the desciption, you should have something like the following.

In [14]:
price = driver.find_element_by_xpath('//span[@data-testid="price"]').text
print(price)
address = driver.find_element_by_xpath('//span[@data-testid="address-label"]').text
print(address)
bedrooms = driver.find_element_by_xpath('//span[@data-testid="beds-label"]').text
print(bedrooms)
div_tag = driver.find_element_by_xpath('//div[@data-testid="truncated_text_container"]')
span_tag = div_tag.find_element_by_xpath('.//span')
description = span_tag.text
print(description)

£7,750,000
West Heath Avenue, Hampstead, London NW11
6 beds
A striking contemporary home with an abundance of light and space.

Location

The property is ideally situated for public transport and the national road network. Golders Green Underground Station (Northern Line) and Bus Terminus is just 400 metres walk, whilst there is easy road access to Brent Cross Shopping Centre, the A406 North Circular Road, the A41/A1 arterial route and junction 1 of the M1 Motorway.

Golders Hill Park is within 160 metres walk and offers beautiful plant displays, enhancing the peaceful setting of the Mediterranean and water gardens, while the park houses a popular café. There are also a variety of leisure facilities including tennis courts, croquet lawn, golf practice nets, butterfly house and a children's play area. In the park is also a free zoo, with a growing collection of rare and exotic birds and mammals such as laughing kookaburras, ring-tailed lemurs and ring-tailed coatis.



A striking home s

Now that we have a button, we can send a click action to it!

In [18]:
dict_properties = {'Price': [], 'Address': [], 'Bedrooms': [], 'Description': []}
price = driver.find_element_by_xpath('//span[@data-testid="price"]').text
dict_properties['Price'].append(price)
address = driver.find_element_by_xpath('//span[@data-testid="address-label"]').text
dict_properties['Address'].append(address)
bedrooms = driver.find_element_by_xpath('//span[@data-testid="beds-label"]').text
dict_properties['Bedrooms'].append(bedrooms)
div_tag = driver.find_element_by_xpath('//div[@data-testid="truncated_text_container"]')
span_tag = div_tag.find_element_by_xpath('.//span')
description = span_tag.text
dict_properties['Description'] = description

In [19]:
dict_properties

{'Price': ['£7,750,000'],
 'Address': ['West Heath Avenue, Hampstead, London NW11'],
 'Bedrooms': ['6 beds'],
 'Description': "A striking contemporary home with an abundance of light and space.\n\nLocation\n\nThe property is ideally situated for public transport and the national road network. Golders Green Underground Station (Northern Line) and Bus Terminus is just 400 metres walk, whilst there is easy road access to Brent Cross Shopping Centre, the A406 North Circular Road, the A41/A1 arterial route and junction 1 of the M1 Motorway.\n\nGolders Hill Park is within 160 metres walk and offers beautiful plant displays, enhancing the peaceful setting of the Mediterranean and water gardens, while the park houses a popular café. There are also a variety of leisure facilities including tennis courts, croquet lawn, golf practice nets, butterfly house and a children's play area. In the park is also a free zoo, with a growing collection of rare and exotic birds and mammals such as laughing koo

## Adding links to a list: Creating a Crawler

As mentioned, it would be more efficient to create a list with all the links and then iterate through that list. Here, I am going to give a small teaser of what it looks like, but, ultimately, it will be your task to complete the whole scraper

Before we move on, I am going to create a list with the accept cookies functionality, so we don't have to repeat myself so many times

In [43]:
from selenium import webdriver
import time

def load_and_accept_cookies() -> webdriver.Chrome:
    '''
    Open Zoopla and accept the cookies
    
    Returns
    -------
    driver: webdriver.Chrome
        This driver is already in the Zoopla webpage
    '''
    driver = webdriver.Chrome() 
    URL = "https://www.zoopla.co.uk/new-homes/property/london/?q=London&results_sort=newest_listings&search_source=new-homes&page_size=25&pn=1&view_type=list"
    driver.get(URL)
    time.sleep(3) 
    try:
        driver.switch_to_frame('gdpr-consent-notice') # This is the id of the frame
        accept_cookies_button = driver.find_element_by_xpath('//*[@id="save"]')
        accept_cookies_button.click()
        time.sleep(1)
    except AttributeError: # If you have the latest version of selenium, the code above won't run because the "switch_to_frame" is deprecated
        driver.switch_to.frame('gdpr-consent-notice') # This is the id of the frame
        accept_cookies_button = driver.find_element_by_xpath('//*[@id="save"]')
        accept_cookies_button.click()
        time.sleep(1)

    except:
        pass

    return driver 

Awesome, let's use this function from now on, it will make our code much more readable

In [25]:
driver = load_and_accept_cookies() # In case it works, driver should be in the Zoopla webpage with the cookies button clicked

  driver.switch_to_frame('gdpr-consent-notice') # This is the id of the frame


Great, let's observe the list of properties one more time. All the properties are in a container as we can see in this image:

<p align=center><img src=images/Selenium_5.png width=900></p>

And, each property is one `<div>` tag inside that container. For example, for the two first properties:

<p align=center><img src=images/Selenium_7.png width=450><img src=images/Selenium_6.png width=450></p>

So, we can find a way to iterate through ALL the properties in that list, and for each iteration, extract the link. We saw earlier that if you have the property, you can easily find the `<a>` tag that contains the `href` like this:
```
property = driver.find_element_by_xpath('//*[@id="listing_60212639"]') # Change this xpath with the xpath the current page has in their properties
a_tag = property.find_element_by_tag_name('a')
link = a_tag.get_attribute('href')
```
Let's use something similar

In [32]:
driver = load_and_accept_cookies()
prop_container = driver.find_element_by_xpath('//*[@class="css-1anhqz4-ListingsContainer earci3d2"]') # XPath corresponding to the Container

  driver.switch_to_frame('gdpr-consent-notice') # This is the id of the frame


Now, `prop_container` is pointing to the list of properties in the website. We have to get ALL the `<div>` tags inside, but only those that are direct children. So, we have to use a relative xpath: `./div`
- The dot represents that it is relative
- The single slash represents direct children

Also, take into account that we are looking for ALL occurence of this XPath, so we have to use the `find_elementS_by_xpath` method


In [41]:
prop_list = prop_container.find_elements_by_xpath('./div')
link_list = []

for property in prop_list:
    a_tag = property.find_element_by_tag_name('a')
    link = a_tag.get_attribute('href')
    link_list.append(link)
    
print(f'There are {len(link_list)} properties in this page')
print(link_list)

There are 25 properties in this page
['https://www.zoopla.co.uk/new-homes/details/60214186/', 'https://www.zoopla.co.uk/new-homes/details/60214076/', 'https://www.zoopla.co.uk/new-homes/details/60213997/', 'https://www.zoopla.co.uk/new-homes/details/60213955/', 'https://www.zoopla.co.uk/new-homes/details/60213931/', 'https://www.zoopla.co.uk/new-homes/details/60213891/', 'https://www.zoopla.co.uk/new-homes/details/60213807/', 'https://www.zoopla.co.uk/new-homes/details/60213785/', 'https://www.zoopla.co.uk/new-homes/details/60213797/', 'https://www.zoopla.co.uk/new-homes/details/60213732/', 'https://www.zoopla.co.uk/new-homes/details/60213753/', 'https://www.zoopla.co.uk/new-homes/details/60213127/', 'https://www.zoopla.co.uk/new-homes/details/60213083/', 'https://www.zoopla.co.uk/new-homes/details/60213079/', 'https://www.zoopla.co.uk/new-homes/details/60212733/', 'https://www.zoopla.co.uk/new-homes/details/59458212/', 'https://www.zoopla.co.uk/new-homes/details/60212693/', 'https://w

Now we have a list of the links of the properties in that page. How awesome is that?

Next, we need to iterate through this list and start visiting each link to extract the data we were interested on (Price, Address, Number of Bedroom, Description)

## Try it out

This will finish the Zoopla practical that you have in the Portal. 

With the new acquired knowledge, extract the data from all the properties in 5 different Zoopla pages. This means that, once you finish scraping a page, you have to click the 'Next Page' button (you can also change the URL if you know how to tweak it). So, once you extract the 25 links, you can go to the next page by clicking 'Next':

<p align=center><img src=images/Selenium_8.png width=450></p>

I included a template you can use to get started:

In [44]:
from selenium import webdriver

def get_links(driver: webdriver.Chrome) -> list:
    '''
    Returns a list with all the links in the current page
    Parameters
    ----------
    driver: webdriver.Chrome
        The driver that contains information about the current page
    
    Returns
    -------
    link_list: list
        A list with all the links in the page
    '''

    prop_container = driver.find_element_by_xpath('//*[@class="css-1anhqz4-ListingsContainer earci3d2"]')
    prop_list = prop_container.find_elements_by_xpath('./div')
    link_list = []

    for property in prop_list:
        a_tag = property.find_element_by_tag_name('a')
        link = a_tag.get_attribute('href')
        link_list.append(link)

    return link_list

big_list = []
driver = load_and_accept_cookies()

for i in range(5): # The first 5 pages only
    big_list.extend(get_links(driver)) # Call the function we just created and extend the big list with the returned list
    ## TODO: Click the next button. Don't forget to use sleeps, so the website doesn't suspect
    pass # This pass should be removed once the code is complete


for link in big_list:
    ## TODO: Visit all the links, and extract the data. Don't forget to use sleeps, so the website doesn't suspect
    pass # This pass should be removed once the code is complete

driver.quit() # Close the browser when you finish

  driver.switch_to_frame('gdpr-consent-notice') # This is the id of the frame


If you find a problem, use the 'Get Support' button in the portal!

## Extra: Navigating using Selenium

Selenium will allow us to do many other things, such as scroll, click, and send keystrokes. For example, you can run the following cells one by one and observe the results.

Let's see how to scroll down a page using selenium

In [7]:
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
driver = webdriver.Chrome()

driver.get("http://www.python.org")

You are in the official Python Webpage, let's scroll down to the bottom of the page.

In [None]:
driver.execute_script("window.scrollTo(0, document.body.scrollHeight);")

The next cell will look for the search bar and click it.

In [None]:
search_bar = driver.find_element_by_xpath('//*[@id="id-search-field"]')
search_bar.click()

Now that you clicked it, you can send a keystroke to the search bar.

In [None]:
search_bar.send_keys("method")

And once you enter the text, you can 'Press Enter'

In [None]:
search_bar.send_keys(Keys.RETURN)

Whenever you need to perform an action in Selenium, just think, what steps are you doing as a human being? If you can explain it with words, Selenium probably can do it, just look at the documentation, or google it!

# Extra: Make Selenium wait for an element to appear

On many ocassions, you will need for an element to appear to scrape it. As mentioned above, many websites are dynamic, meaning that its whole content is not available right after connecting to it. If that is the case, Selenium will try to find elements before the whole page is loaded, and therefore the scraper might fail if the element is not ready.

To solve this problem, you can tell selenium to wait until the element you want to scrape appears. For example, in the Zoopla challenge, the frame containing the "Accept Cookies" button will not appear immediately, that is why we added a `time.sleep(3)` after telling the driver to get to that website:
```
    driver = webdriver.Chrome() 
    URL = "https://www.zoopla.co.uk/new-homes/property/london/?q=London&results_sort=newest_listings&search_source=new-homes&page_size=25&pn=1&view_type=list"
    driver.get(URL)
    time.sleep(3) 
    try:
        driver.switch_to_frame('gdpr-consent-notice') # This is the id of the frame
        accept_cookies_button = driver.find_element_by_xpath('//*[@id="save"]')
        accept_cookies_button.click()
    ...
```

However, depending on the server and the user connection that number of seconds might vary. So, instead of using an arbitrary number like `3`, we might want to tell Selenium: "Wait until this frame shows up"

Selenium has many capabilities, and luckily, one of them allows us to implement this functionality. Let's take a look at how to do so:

First, let's import the libraries we need

In [10]:
from selenium import webdriver
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.common.by import By
from selenium.common.exceptions import TimeoutException
import time

Now, let's implement it on the code we had

In [13]:
def load_and_accept_cookies() -> webdriver.Chrome:
    '''
    Open Zoopla and accept the cookies
    
    Returns
    -------
    driver: webdriver.Chrome
        This driver is already in the Zoopla webpage
    '''
    driver = webdriver.Chrome() 
    URL = "https://www.zoopla.co.uk/new-homes/property/london/?q=London&results_sort=newest_listings&search_source=new-homes&page_size=25&pn=1&view_type=list"
    driver.get(URL)
    delay = 10
    try:
        WebDriverWait(driver, delay).until(EC.presence_of_element_located((By.XPATH, '//*[@id="gdpr-consent-notice"]')))
        print("Frame Ready!")
        driver.switch_to.frame('gdpr-consent-notice')
        accept_cookies_button = WebDriverWait(driver, delay).until(EC.presence_of_element_located((By.XPATH, '//*[@id="save"]')))
        print("Accept Cookies Button Ready!")
        accept_cookies_button.click()
        time.sleep(1)
    except TimeoutException:
        print("Loading took too much time!")

    return driver 

So, what's happening here?

1. As always, we define the driver and tell it to visit the URL
```
driver = webdriver.Chrome() 
URL = "https://www.zoopla.co.uk/new-homes/property/london/?q=London&results_sort=newest_listings&search_source=new-homes&page_size=25&pn=1&view_type=list"
driver.get(URL)
```
2. We set a variable named delay, which is the maximum time we allow selenium to wait.
```
delay = 10
```
3. Then, we use the WebDriverWait class to tell the driver to way a maximum of 10 seconds. Within those 10 seconds, if the element corresponding to the XPath whose value is `'//*[@id="gdpr-consent-notice"]'` (which corresponds to the frame) appears, then, stop waiting, ans keep running the code.
```
WebDriverWait(driver, delay).until(EC.presence_of_element_located((By.XPATH, '//*[@id="gdpr-consent-notice"]')))
```
4. If the element appears before 10 seconds, we just go with the regular code to click the button. This code in turn has another WebDriverWait just in case the button appears after switching to frame
```
print("Frame Ready!")
driver.switch_to.frame('gdpr-consent-notice')
accept_cookies_button = WebDriverWait(driver, delay).until(EC.presence_of_element_located((By.XPATH, '//*[@id="save"]')))
print("Accept Cookies Button Ready!")
accept_cookies_button.click()
time.sleep(1)
```
5. However, if the element doesn't show up in less than 10 seconds, selenium throws a `TimeoutException` error, and the `except` clause is triggered
```
except TimeoutException:
    print("Loading took too much time!")
```

Let's see how it works:

In [12]:
driver = load_and_accept_cookies()

Frame Ready!


Awesome! Thanks to this, we don't have to worry about setting an arbitrary number of seconds to wait