___

Main: `Selenium`

Programming Language: `Python`

___

### Selenium

Selenium is a Python library and tool used for automating web browsers to do a number of tasks.

> Selenium Python bindings provides a simple API to write functional/acceptance tests using Selenium WebDriver. Through Selenium Python API you can access all functionalities of Selenium WebDriver in an intuitive way.

In this notebook we are going to have a look on how we can use the `selenium` library in performing many things such as data scraping, etc.

### Installation

To use `selenium` we need to make sure that we have installed it. To install this package we are going to run the following code cell.


In [1]:
!pip install selenium

Collecting selenium
  Downloading selenium-4.2.0-py3-none-any.whl (983 kB)
Collecting trio-websocket~=0.9
  Downloading trio_websocket-0.9.2-py3-none-any.whl (16 kB)
Collecting trio~=0.17
  Downloading trio-0.21.0-py3-none-any.whl (358 kB)
Collecting urllib3[secure,socks]~=1.26
  Using cached urllib3-1.26.9-py2.py3-none-any.whl (138 kB)
Collecting wsproto>=0.14
  Downloading wsproto-1.1.0-py3-none-any.whl (24 kB)
Collecting sniffio
  Using cached sniffio-1.2.0-py3-none-any.whl (10 kB)
Collecting outcome
  Downloading outcome-1.2.0-py2.py3-none-any.whl (9.7 kB)
Collecting h11<1,>=0.9.0
  Downloading h11-0.13.0-py3-none-any.whl (58 kB)
Installing collected packages: h11, wsproto, sniffio, outcome, trio, trio-websocket, urllib3, selenium
  Attempting uninstall: urllib3
    Found existing installation: urllib3 1.25.11
    Uninstalling urllib3-1.25.11:
      Successfully uninstalled urllib3-1.25.11
Successfully installed h11-0.13.0 outcome-1.2.0 selenium-4.2.0 sniffio-1.2.0 trio-0.21.0 trio

ERROR: After October 2020 you may experience errors when installing or updating packages. This is because pip will change the way that it resolves dependency conflicts.

We recommend you use --use-feature=2020-resolver to test your packages with the new resolver before it becomes the default.

conda 4.10.3 requires ruamel_yaml_conda>=0.11.14, which is not installed.


### Practical Example 1
In this practical example we are going to demostrate how we can interact with twitter starting from login to twitter and extending it to scrapping data on the twitter web page using `selenium` and `python`.

### Imports

In the following code cell we are going to import the packages that we are going to use inorder to finish this pactical. 


In [2]:
import re
import csv
import os

from getpass import getpass
from time import sleep

from selenium.webdriver.common.keys import Keys
from selenium.common.exceptions import NoSuchElementException
from selenium import webdriver
from selenium.webdriver.common.by import By

### Install and Access WebDriver
A webdriver is a vital ingredient to this process. It is what will actually be automatically opening up your browser to access your website of choice. This step is different based on which browser you use to explore the internet. I use Google Chrome. Some say Chrome works best with Selenium, although it does also support Internet Explorer, Firefox, Safari, and Opera. For chrome you first need to download the webdriver [here](https://chromedriver.chromium.org/downloads). There are several different download options based on your version of Chrome. To locate what version of Chrome you have, click on the `3` vertical dots at the top right corner of your browser window, scroll down to `help`, and select `About Google Chrome`. There you will see your version. I have version `102.0.5005.115`.

> So note that i will download the driver of version `102.0.5005.115` [here](https://chromedriver.chromium.org/downloads). I will download and `unzip` the `chromedriver.exe` file and move it into my `C` drive so that the path to this driver will be: 

```shell
"C:\chromedriver.exe"
```

**Note:** _You must know the location of your driver._


### Creating a `driver` object

In the following code cell we are going to create a `driver` object. We are going to use the `webdriver` from selenium and specify the path where our driver is located.


In [4]:
PATH_TO_DRIVER = "C:\\chromedriver.exe"

driver = webdriver.Chrome(PATH_TO_DRIVER)

  driver = webdriver.Chrome(PATH_TO_DRIVER)


> The above method is now deprecated. With `selenium4` as the key executable_path is deprecated you have to use an instance of the `Service()` class along with `ChromeDriverManager().install()` 

But first we need to install `webdriver-manager`

In [6]:
!pip install webdriver-manager

Collecting webdriver-manager
  Downloading webdriver_manager-3.7.0-py2.py3-none-any.whl (25 kB)
Collecting python-dotenv
  Downloading python_dotenv-0.20.0-py3-none-any.whl (17 kB)
Installing collected packages: python-dotenv, webdriver-manager
Successfully installed python-dotenv-0.20.0 webdriver-manager-3.7.0


In [7]:
from selenium.webdriver.chrome.service import Service
from webdriver_manager.chrome import ChromeDriverManager

In [51]:
driver = webdriver.Chrome(service=Service(ChromeDriverManager().install()))




[WDM] - Current google-chrome version is 102.0.5005
[WDM] - Get LATEST chromedriver version for 102.0.5005 google-chrome
[WDM] - Driver [C:\Users\crisp\.wdm\drivers\chromedriver\win32\102.0.5005.61\chromedriver.exe] found in cache


### Opening a twitter website

To open the twitter website we are gong to use the `get` method from the `driver` object as follows:

> Note that we are going to access the login page so that we can provide our twitter credentials to authenticate to twitter.

In [9]:
driver.get('https://www.twitter.com/login')
driver.maximize_window()

### Loggin in to Twitter

In the following code cell we are going to login to twitter. We are going to use my twitter username `@crispengari_` and we are going to get the password using the `getpass` package that we have imported ealier.

In [10]:
username = "@crispengari_"
password = getpass(prompt = 'Enter the password')

Enter the password········


After getting the username and password we are going to access the textbox for username and enter the username and click the next button in the followng code cell.

So we are accessing the input from the root '//' of the document that has an attribute `name` with a value of `text`

In [11]:
username_field = driver.find_element(By.XPATH, '//input[@name="text"]')

Adding value to the textbox.

In [12]:
username_field.send_keys(username)

Submitting the value

In [13]:
username_field.send_keys(Keys.RETURN)

Getting the password field.

In [15]:
password_field = driver.find_element(By.XPATH, '//input[@name="password"]')

Adding the value to the password field

In [16]:
password_field.send_keys(password)

Submtting the value to the password field.

In [17]:
password_field.send_keys(Keys.RETURN)

> If the login credentials provided are correct, then we will be logged in on our twitter account.


### Seaching on twitter
 
We are going to search tweet for `elonmusk`. We need to access the search input on twitter and and `elonmusk` in the search box and click `RETURN` key. We will be interested in scrapping the latest tweets for `elonmusk` so we will need to access the link tag `lattest`.


In [28]:
search_input = driver.find_element(By.XPATH, '//input[@aria-label="Search query"]')

In [29]:
search_term = "elonmusk"
search_input.send_keys(search_term)

In [30]:
search_input.send_keys(Keys.RETURN)

Navigating to lattest tweets of `elonmusk`

In [31]:
driver.find_element(By.LINK_TEXT, 'Latest').click()

### Page Cards

In twitter pages are pagnated based on the scroll events and they are not consistent. In the following code cell we are going to get all the tweet cards for the first page on the latest tweets of `elonmusk`. We are gong to use a method called `find_elements` which returns us a lists of elements.

In [34]:
page_cards = driver.find_elements(By.XPATH, '//article[@data-testid="tweet"]')

The first page card will look as follows.

In [36]:
card = page_cards[0]
card

<selenium.webdriver.remote.webelement.WebElement (session="4623aac710cd7a9d8342700bda76e761", element="05e489a6-9caa-433f-9186-0fa6a98edab8")>

> In the following code cells we are going to demostrate how to get data relative `.//` to a single tweet `card`.

In [38]:
username = card.find_element(By.XPATH, './/span').text ## relative to that card
username

'Gary Vida'

In [39]:
handle = card.find_element(By.XPATH, './/span[contains(text(), "@")]').text
handle

'@GaryVida13'

In [40]:
postdate = card.find_element(By.XPATH, './/time').get_attribute('datetime')
postdate

'2022-06-17T06:00:53.000Z'

> According to the docummentation the `By` class is used to locate elements by:

    
```
ID = "id"
NAME = "name"
XPATH = "xpath"
LINK_TEXT = "link text"
PARTIAL_LINK_TEXT = "partial link text"
TAG_NAME = "tag name"
CLASS_NAME = "class name"
CSS_SELECTOR = "css selector"
```

In [48]:
card.element

AttributeError: 'WebElement' object has no attribute 'element'

In [46]:
comment = card.find_element(By.TAG_NAME, 'div')
comment

StaleElementReferenceException: Message: stale element reference: element is not attached to the page document
  (Session info: chrome=102.0.5005.115)
Stacktrace:
Backtrace:
	Ordinal0 [0x00E9D953+2414931]
	Ordinal0 [0x00E2F5E1+1963489]
	Ordinal0 [0x00D1C6B8+837304]
	Ordinal0 [0x00D1F0B4+848052]
	Ordinal0 [0x00D1EF72+847730]
	Ordinal0 [0x00D1F200+848384]
	Ordinal0 [0x00D49215+1020437]
	Ordinal0 [0x00D4979B+1021851]
	Ordinal0 [0x00D3FCF1+982257]
	Ordinal0 [0x00D644E4+1131748]
	Ordinal0 [0x00D3FC74+982132]
	Ordinal0 [0x00D646B4+1132212]
	Ordinal0 [0x00D74812+1198098]
	Ordinal0 [0x00D642B6+1131190]
	Ordinal0 [0x00D3E860+976992]
	Ordinal0 [0x00D3F756+980822]
	GetHandleVerifier [0x0110CC62+2510274]
	GetHandleVerifier [0x010FF760+2455744]
	GetHandleVerifier [0x00F2EABA+551962]
	GetHandleVerifier [0x00F2D916+547446]
	Ordinal0 [0x00E35F3B+1990459]
	Ordinal0 [0x00E3A898+2009240]
	Ordinal0 [0x00E3A985+2009477]
	Ordinal0 [0x00E43AD1+2046673]
	BaseThreadInitThunk [0x7702FA29+25]
	RtlGetAppContainerNamedObjectPath [0x77D57A9E+286]
	RtlGetAppContainerNamedObjectPath [0x77D57A6E+238]


In [None]:
responding = card.find_element(By.XPATH, './/div[2]/div[2]/div[2]').text
responding

In [None]:
def get_tweet_data(card):
    username = card.find_element(By.XPATH'.//span').text
    try:
        handle = card.find_element(By.XPATH, './/span[contains(text(), "@")]').text
    except NoSuchElementException:
        return
    
    try:
        postdate = card.find_element(By.XPATH, './/time').get_attribute('datetime')
    except NoSuchElementException:
        return
    
    comment = card.find_element(By.XPATH, './/div[2]/div[2]/div[1]').text
    responding = card.find_element(By.XPATH, './/div[2]/div[2]/div[2]').text
    text = comment + responding
    reply_cnt = card.find_element(By.XPATH, './/div[@data-testid="reply"]').text
    retweet_cnt = card.find_element(By.XPATH, './/div[@data-testid="retweet"]').text
    like_cnt = card.find_element(By.XPATH, './/div[@data-testid="like"]').text
    
    # get a string of all emojis contained in the tweet
    """Emojis are stored as images... so I convert the filename, which is stored as unicode, into 
    the emoji character."""
    emoji_tags = card.find_elements(By.XPATH, './/img[contains(@src, "emoji")]')
    emoji_list = []
    for tag in emoji_tags:
        filename = tag.get_attribute('src')
        try:
            emoji = chr(int(re.search(r'svg\/([a-z0-9]+)\.svg', filename).group(1), base=16))
        except AttributeError:
            continue
        if emoji:
            emoji_list.append(emoji)
    emojis = ' '.join(emoji_list)
    
    tweet = (username, handle, postdate, text, emojis, reply_cnt, retweet_cnt, like_cnt)
    return tweet    

In [None]:
# get all tweets on the page
data = []
tweet_ids = set()
last_position = driver.execute_script("return window.pageYOffset;")
scrolling = True

while scrolling:
    page_cards = driver.find_elements_by_xpath('//div[@data-testid="tweet"]')
    for card in page_cards[-15:]:
        tweet = get_tweet_data(card)
        if tweet:
            tweet_id = ''.join(tweet)
            if tweet_id not in tweet_ids:
                tweet_ids.add(tweet_id)
                data.append(tweet)
            
    scroll_attempt = 0
    while True:
        # check scroll position
        driver.execute_script('window.scrollTo(0, document.body.scrollHeight);')
        sleep(2)
        curr_position = driver.execute_script("return window.pageYOffset;")
        if last_position == curr_position:
            scroll_attempt += 1
            
            # end of scroll region
            if scroll_attempt >= 3:
                scrolling = False
                break
            else:
                sleep(2) # attempt another scroll
        else:
            last_position = curr_position
            break

### Saving the data

We are going to save the data in a file called `elonmusk.csv`.

In [None]:
with open('turkcell_tweets.csv', 'w', newline='', encoding='utf-8') as f:
    header = ['userName', 'handle', 'timestamp', 'text', 'emojis', 'comments', 'likes', 'retweets']
    writer = csv.writer(f)
    writer.writerow(header)
    writer.writerows(data)

### Closing the driver
After you are done, you can close the tab by running the following:

In [52]:
driver.close()

**OR**

In [49]:
driver.quit()

### References

1. [selenium-python.readthedocs.io](https://selenium-python.readthedocs.io/)