### This notebook demonstrates how to crawl a Javascript-rendered website and also teaches advanced topics in web crawling. 

This is an **advanced topic** for web crawling.

#### Topics covered in this tutorial:

- Crawling javascript website
- Crawling login website
- Crawling website with input forms
- Crawling website using infinite rolling
- And more ...

### Javascript-rendered website

Go to **http://quotes.toscrape.com/js/** (A javascript website)

In [1]:
import requests
from lxml import html
import pandas as pd
import csv

In [2]:
#storing response
response = requests.get('http://quotes.toscrape.com/js/')
data = html.fromstring(response.text)

print(data.xpath("//span/text()"))

['→', '❤']



The above Xpath appears to be correct, but it does not return the data we're expecting. This is because this webpage is javascript-rendered page.


Crawling Javascipt pages require advanced approach: **Python Selenium**

### Selenium

Install python selenium **pip install selenium**

Selenium requires a **driver** to interface with the chosen browser. Firefox, for example, requires **geckodriver**, which needs to be installed before the below examples can be run. 

Go to https://github.com/mozilla/geckodriver/releases (**Firefox** is used in this tutorial) and download **geckodriver** (and unzip the file). After unzipping, place **the exe file** in **/Anaconda/Library/bin** and **Make sure it’s in your PATH (environment variables).**

<img src="images\geckodriver.png">
<img src="images\path.png">

Failure to observe this step will give you an error selenium.common.exceptions.WebDriverException: Message: ‘geckodriver’ executable needs to be in PATH.

Other supported browsers will have their own drivers available. Links to some of the more popular browser drivers follow.

- Chrome:	https://sites.google.com/a/chromium.org/chromedriver/downloads
- Edge:	https://developer.microsoft.com/en-us/microsoft-edge/tools/webdriver/
- Safari:	https://webkit.org/blog/6900/webdriver-support-in-safari-10/

Source: Python selenium webpage

### Example: Crawling Javascript site using selenium (CHROME)

- Find the location of Anaconda on your computer and place the downloaded 'chromedriver' into \\Anaconda3\Library\bin

- Add the location of the downloaded 'chromedriver' driver to Path. Advanced System Settings > Environment Variables > Path > Edit > New > Copy File Address and click 'OK'

In [3]:
from selenium import webdriver

driver = webdriver.Chrome()
driver.get("http://quotes.toscrape.com/js/")

title = driver.find_elements_by_xpath("//div[@class='quote']/span[@class='text']")

for i in title:
    print(i.text)

“The world as we have created it is a process of our thinking. It cannot be changed without changing our thinking.”
“It is our choices, Harry, that show what we truly are, far more than our abilities.”
“There are only two ways to live your life. One is as though nothing is a miracle. The other is as though everything is a miracle.”
“The person, be it gentleman or lady, who has not pleasure in a good novel, must be intolerably stupid.”
“Imperfection is beauty, madness is genius and it's better to be absolutely ridiculous than absolutely boring.”
“Try not to become a man of success. Rather become a man of value.”
“It is better to be hated for what you are than to be loved for what you are not.”
“I have not failed. I've just found 10,000 ways that won't work.”
“A woman is like a tea bag; you never know how strong it is until it's in hot water.”
“A day without sunshine is like, you know, night.”


### Example: Crawling Javascript site using selenium (FIREFOX)

In [4]:
from selenium import webdriver

driver = webdriver.Firefox()
driver.get("http://quotes.toscrape.com/js/")

title = driver.find_elements_by_xpath("//div[@class='quote']/span[@class='text']")

for i in title:
    print(i.text)

“The world as we have created it is a process of our thinking. It cannot be changed without changing our thinking.”
“It is our choices, Harry, that show what we truly are, far more than our abilities.”
“There are only two ways to live your life. One is as though nothing is a miracle. The other is as though everything is a miracle.”
“The person, be it gentleman or lady, who has not pleasure in a good novel, must be intolerably stupid.”
“Imperfection is beauty, madness is genius and it's better to be absolutely ridiculous than absolutely boring.”
“Try not to become a man of success. Rather become a man of value.”
“It is better to be hated for what you are than to be loved for what you are not.”
“I have not failed. I've just found 10,000 ways that won't work.”
“A woman is like a tea bag; you never know how strong it is until it's in hot water.”
“A day without sunshine is like, you know, night.”


### Locating Elements
http://selenium-python.readthedocs.io/locating-elements.html

There are various strategies to locate elements in a page. You can use the most appropriate one for your case. Selenium provides the following methods to locate elements in a page:

    find_element_by_id
    find_element_by_name
    find_element_by_xpath
    find_element_by_link_text
    find_element_by_partial_link_text
    find_element_by_tag_name
    find_element_by_class_name
    find_element_by_css_selector

To find multiple elements (these methods will return a **list**):

    find_elements_by_name
    find_elements_by_xpath
    find_elements_by_link_text
    find_elements_by_partial_link_text
    find_elements_by_tag_name
    find_elements_by_class_name
    find_elements_by_css_selector

### Examples: Locating Elements
http://selenium-python.readthedocs.io/locating-elements.html

#### Locating Elements by Class Name

Use this when you want to locate an element by class attribute name. With this strategy, the first element with the matching class attribute name will be returned. If no element has a matching class attribute name, a NoSuchElementException will be raised.

For instance, consider this page source:

    <html>
     <body>
      <p class="content">Site content goes here.</p>
    </body>
    <html>

The “p” element can be located like this:

    content = driver.find_element_by_class_name('content')
    
    
#### Locating by XPath

For instance, consider this page source:

    <html>
     <body>
      <form id="loginForm">
       <input name="username" type="text" />
       <input name="password" type="password" />
       <input name="continue" type="submit" value="Login" />
       <input name="continue" type="button" value="Clear" />
      </form>
    </body>
    <html>
    
The form elements can be located like this:

    login_form = driver.find_element_by_xpath("/html/body/form[1]")
    login_form = driver.find_element_by_xpath("//form[1]")
    login_form = driver.find_element_by_xpath("//form[@id='loginForm']")
    
1. Absolute path (would break if the HTML was changed only slightly)
2. First form element in the HTML
3. The form element with attribute named id and the value loginForm

The username element can be located like this:

    username = driver.find_element_by_xpath("//form[input/@name='username']")
    username = driver.find_element_by_xpath("//form[@id='loginForm']/input[1]")
    username = driver.find_element_by_xpath("//input[@name='username']")

### Example: Login page

In [3]:
from selenium import webdriver
import time

driver = webdriver.Firefox()
driver.get("http://quotes.toscrape.com/login")

time.sleep(5)

username = driver.find_element_by_id("username")
password = driver.find_element_by_id("password")

username.send_keys("abc")
password.send_keys("abc")

login_attempt = driver.find_element_by_xpath("//*[@type='submit']")
login_attempt.submit()

### Example: Login and Collect Data

In [15]:
from selenium import webdriver
import time

driver = webdriver.Chrome()
driver.get("http://quotes.toscrape.com/login")

time.sleep(5)

username = driver.find_element_by_id("username")
password = driver.find_element_by_id("password")

username.send_keys("abc")
password.send_keys("abc")

login_attempt = driver.find_element_by_xpath("//*[@type='submit']")
login_attempt.submit()

#just wait a little for the browser to be ready
time.sleep(10)

for review in driver.find_elements_by_xpath("//div[@class='quote']"):
    name = review.find_element_by_xpath("span[2]/small[@class='author']").text
    tags = review.find_element_by_xpath("div[@class='tags']").text
    url = review.find_element_by_xpath("span[2]/a[2]").get_attribute('href')
    print(name, url)

Albert Einstein http://goodreads.com/author/show/9810.Albert_Einstein
J.K. Rowling http://goodreads.com/author/show/1077326.J_K_Rowling
Albert Einstein http://goodreads.com/author/show/9810.Albert_Einstein
Jane Austen http://goodreads.com/author/show/1265.Jane_Austen
Marilyn Monroe http://goodreads.com/author/show/82952.Marilyn_Monroe
Albert Einstein http://goodreads.com/author/show/9810.Albert_Einstein
André Gide http://goodreads.com/author/show/7617.Andr_Gide
Thomas A. Edison http://goodreads.com/author/show/3091287.Thomas_A_Edison
Eleanor Roosevelt http://goodreads.com/author/show/44566.Eleanor_Roosevelt
Steve Martin http://goodreads.com/author/show/7103.Steve_Martin


In [17]:
# save data
import pandas as pd
from selenium import webdriver
import time

driver = webdriver.Chrome()
driver.get("http://quotes.toscrape.com/login")

time.sleep(5)

username = driver.find_element_by_id("username")
password = driver.find_element_by_id("password")

username.send_keys("abc")
password.send_keys("abc")

login_attempt = driver.find_element_by_xpath("//*[@type='submit']")
login_attempt.submit()

#just wait a little for the browser to be ready
time.sleep(10)

data = []
for review in driver.find_elements_by_xpath("//div[@class='quote']"):
    name = review.find_element_by_xpath("span[2]/small[@class='author']").text
    url = review.find_element_by_xpath("span[2]/a[2]").get_attribute('href')
    print(name, url)
    data.append([name, url])

df = pd.DataFrame(data)
df.to_csv("quotes.csv", index=False, encoding='utf-8')    

Albert Einstein http://goodreads.com/author/show/9810.Albert_Einstein
J.K. Rowling http://goodreads.com/author/show/1077326.J_K_Rowling
Albert Einstein http://goodreads.com/author/show/9810.Albert_Einstein
Jane Austen http://goodreads.com/author/show/1265.Jane_Austen
Marilyn Monroe http://goodreads.com/author/show/82952.Marilyn_Monroe
Albert Einstein http://goodreads.com/author/show/9810.Albert_Einstein
André Gide http://goodreads.com/author/show/7617.Andr_Gide
Thomas A. Edison http://goodreads.com/author/show/3091287.Thomas_A_Edison
Eleanor Roosevelt http://goodreads.com/author/show/44566.Eleanor_Roosevelt
Steve Martin http://goodreads.com/author/show/7103.Steve_Martin


### Example: Form

In [21]:
from selenium import webdriver
from selenium.webdriver.support.ui import Select
import time

driver = webdriver.Chrome()
driver.get("http://quotes.toscrape.com/search.aspx")

time.sleep(5)

#driver.find_element_by_xpath("//select[@name='author']/option[text()='Steve Martin']").click()
#driver.find_element_by_xpath("//select[@name='tag']/option[text()='humor']").click()

select_author = Select(driver.find_element_by_name('author'))
select_author.select_by_visible_text('Steve Martin')

select_tag = Select(driver.find_element_by_name('tag'))
select_tag.select_by_visible_text('humor')

login_attempt = driver.find_element_by_xpath("//*[@type='submit']")
login_attempt.submit()

time.sleep(5)

author = driver.find_element_by_xpath("//div[@class='quote']/span[@class='author']").text
quote = driver.find_element_by_xpath("//div[@class='quote']/span[@class='content']").text
                                     
print(author, quote)

Steve Martin “A day without sunshine is like, you know, night.”


### Example: Pagination

In [5]:
### THERE IS NO STOP STATEMENT HERE SO DO NOT RUN LONGER THAN NEEDED. CONSIDER A FOR LOOP TO ITERATE OVER A SET NUMBER OF PAGES

from selenium import webdriver
import time

driver = webdriver.Firefox()
driver.get("http://quotes.toscrape.com/")

while True:
    for review in driver.find_elements_by_xpath("//div[@class='quote']"):
        name = review.find_element_by_xpath("span[2]/small[@class='author']").text
        url = review.find_element_by_xpath("span[2]/a[1]").get_attribute('href')
        print(name, url)
 
    try:
        next_link = driver.find_element_by_xpath("//li[@class='next']/a")
        next_link.click()
        time.sleep(5)
    except:
        break

Albert Einstein http://quotes.toscrape.com/author/Albert-Einstein
J.K. Rowling http://quotes.toscrape.com/author/J-K-Rowling
Albert Einstein http://quotes.toscrape.com/author/Albert-Einstein
Jane Austen http://quotes.toscrape.com/author/Jane-Austen
Marilyn Monroe http://quotes.toscrape.com/author/Marilyn-Monroe
Albert Einstein http://quotes.toscrape.com/author/Albert-Einstein
André Gide http://quotes.toscrape.com/author/Andre-Gide
Thomas A. Edison http://quotes.toscrape.com/author/Thomas-A-Edison
Eleanor Roosevelt http://quotes.toscrape.com/author/Eleanor-Roosevelt
Steve Martin http://quotes.toscrape.com/author/Steve-Martin


In [6]:
from selenium import webdriver
import time
import csv

driver = webdriver.Firefox()
driver.get("http://quotes.toscrape.com/")

data = []
for i in range(1,4):   # first three pages only
    for review in driver.find_elements_by_xpath("//div[@class='quote']"):
        name = review.find_element_by_xpath("span[2]/small[@class='author']").text.encode('utf-8')
        url = review.find_element_by_xpath("span[2]/a[1]").get_attribute('href').encode('utf-8')
        print(name, url)
        data.append([name, url])
 
    try:
        next_link = driver.find_element_by_xpath("//li[@class='next']/a")
        next_link.click()
        time.sleep(5)
    except:
        break
        

df = pd.DataFrame(data)
df.to_csv("quotes_pagination.csv", index=False, encoding='utf-8') 

b'Albert Einstein' b'http://quotes.toscrape.com/author/Albert-Einstein'
b'J.K. Rowling' b'http://quotes.toscrape.com/author/J-K-Rowling'
b'Albert Einstein' b'http://quotes.toscrape.com/author/Albert-Einstein'
b'Jane Austen' b'http://quotes.toscrape.com/author/Jane-Austen'
b'Marilyn Monroe' b'http://quotes.toscrape.com/author/Marilyn-Monroe'
b'Albert Einstein' b'http://quotes.toscrape.com/author/Albert-Einstein'
b'Andr\xc3\xa9 Gide' b'http://quotes.toscrape.com/author/Andre-Gide'
b'Thomas A. Edison' b'http://quotes.toscrape.com/author/Thomas-A-Edison'
b'Eleanor Roosevelt' b'http://quotes.toscrape.com/author/Eleanor-Roosevelt'
b'Steve Martin' b'http://quotes.toscrape.com/author/Steve-Martin'
b'Marilyn Monroe' b'http://quotes.toscrape.com/author/Marilyn-Monroe'
b'J.K. Rowling' b'http://quotes.toscrape.com/author/J-K-Rowling'
b'Albert Einstein' b'http://quotes.toscrape.com/author/Albert-Einstein'
b'Bob Marley' b'http://quotes.toscrape.com/author/Bob-Marley'
b'Dr. Seuss' b'http://quotes.tos

### Example: Infinite Rolling

In [27]:
from selenium import webdriver
from selenium.webdriver.support.ui import Select
import time

driver = webdriver.Chrome()
driver.get("http://spidyquotes.herokuapp.com/scroll")
#driver.execute_script("window.scrollTo(0, document.body.scrollHeight);")

SCROLL_PAUSE_TIME = 0.5

# Get scroll height
last_height = driver.execute_script("return document.body.scrollHeight")

while True:
    for row in driver.find_elements_by_xpath("//div[@class='quote']"):
        author = row.find_element_by_xpath("span[2]/small[@class='author']").text
        quote = row.find_element_by_xpath("span[@class='text']").text
        print(author, quote)
    
    try:
        # Scroll down to bottom
        driver.execute_script("window.scrollTo(0, document.body.scrollHeight);")

        # Wait to load page
        time.sleep(SCROLL_PAUSE_TIME)

        # Calculate new scroll height and compare with last scroll height
        new_height = driver.execute_script("return document.body.scrollHeight")
        if new_height == last_height:
            break
        last_height = new_height
    except:
        break


Albert Einstein “The world as we have created it is a process of our thinking. It cannot be changed without changing our thinking.”
J.K. Rowling “It is our choices, Harry, that show what we truly are, far more than our abilities.”
Albert Einstein “There are only two ways to live your life. One is as though nothing is a miracle. The other is as though everything is a miracle.”
Jane Austen “The person, be it gentleman or lady, who has not pleasure in a good novel, must be intolerably stupid.”
Marilyn Monroe “Imperfection is beauty, madness is genius and it's better to be absolutely ridiculous than absolutely boring.”
Albert Einstein “Try not to become a man of success. Rather become a man of value.”
André Gide “It is better to be hated for what you are than to be loved for what you are not.”
Thomas A. Edison “I have not failed. I've just found 10,000 ways that won't work.”
Eleanor Roosevelt “A woman is like a tea bag; you never know how strong it is until it's in hot water.”
Steve Martin

Mother Teresa “If you judge people, you have no time to love them.”
Garrison Keillor “Anyone who thinks sitting in church can make you a Christian must also think that sitting in a garage can make you a car.”
Jim Henson “Beauty is in the eye of the beholder and it may be necessary from time to time to give a stupid or misinformed beholder a black eye.”
Dr. Seuss “Today you are You, that is truer than true. There is no one alive who is Youer than You.”
Albert Einstein “If you want your children to be intelligent, read them fairy tales. If you want them to be more intelligent, read them more fairy tales.”
J.K. Rowling “It is impossible to live without failing at something, unless you live so cautiously that you might as well not have lived at all - in which case, you fail by default.”
Albert Einstein “Logic will get you from A to Z; imagination will get you everywhere.”
Bob Marley “One good thing about music, when it hits you, you feel no pain.”
Albert Einstein “The world as we have crea

Elie Wiesel “The opposite of love is not hate, it's indifference. The opposite of art is not ugliness, it's indifference. The opposite of faith is not heresy, it's indifference. And the opposite of life is not death, it's indifference.”
Friedrich Nietzsche “It is not a lack of love, but a lack of friendship that makes unhappy marriages.”
Mark Twain “Good friends, good books, and a sleepy conscience: this is the ideal life.”
Allen Saunders “Life is what happens to us while we are making other plans.”
Pablo Neruda “I love you without knowing how, or when, or from where. I love you simply, without problems or pride: I love you in this way because I do not know any other way of loving but this, in which there is no I or you, so intimate that your hand upon my chest is my hand, so intimate that when I fall asleep your eyes close.”
Ralph Waldo Emerson “For every minute you are angry you lose sixty seconds of happiness.”
Mother Teresa “If you judge people, you have no time to love them.”
Garr

Dr. Seuss “Today you are You, that is truer than true. There is no one alive who is Youer than You.”
Albert Einstein “If you want your children to be intelligent, read them fairy tales. If you want them to be more intelligent, read them more fairy tales.”
J.K. Rowling “It is impossible to live without failing at something, unless you live so cautiously that you might as well not have lived at all - in which case, you fail by default.”
Albert Einstein “Logic will get you from A to Z; imagination will get you everywhere.”
Bob Marley “One good thing about music, when it hits you, you feel no pain.”
Dr. Seuss “The more that you read, the more things you will know. The more that you learn, the more places you'll go.”
J.K. Rowling “Of course it is happening inside your head, Harry, but why on earth should that mean that it is not real?”
Bob Marley “The truth is, everyone is going to hurt you. You just got to find the ones worth suffering for.”
Mother Teresa “Not all of us can do great things

Dr. Seuss “The more that you read, the more things you will know. The more that you learn, the more places you'll go.”
J.K. Rowling “Of course it is happening inside your head, Harry, but why on earth should that mean that it is not real?”
Bob Marley “The truth is, everyone is going to hurt you. You just got to find the ones worth suffering for.”
Mother Teresa “Not all of us can do great things. But we can do small things with great love.”
J.K. Rowling “To the well-organized mind, death is but the next great adventure.”
Charles M. Schulz “All you need is love. But a little chocolate now and then doesn't hurt.”
William Nicholson “We read to know we're not alone.”
Albert Einstein “Any fool can know. The point is to understand.”
Jorge Luis Borges “I have always imagined that Paradise will be a kind of library.”
George Eliot “It is never too late to be what you might have been.”
George R.R. Martin “A reader lives a thousand lives before he dies, said Jojen. The man who never reads lives on

Jim Henson “Beauty is in the eye of the beholder and it may be necessary from time to time to give a stupid or misinformed beholder a black eye.”
Dr. Seuss “Today you are You, that is truer than true. There is no one alive who is Youer than You.”
Albert Einstein “If you want your children to be intelligent, read them fairy tales. If you want them to be more intelligent, read them more fairy tales.”
J.K. Rowling “It is impossible to live without failing at something, unless you live so cautiously that you might as well not have lived at all - in which case, you fail by default.”
Albert Einstein “Logic will get you from A to Z; imagination will get you everywhere.”
Bob Marley “One good thing about music, when it hits you, you feel no pain.”
Dr. Seuss “The more that you read, the more things you will know. The more that you learn, the more places you'll go.”
J.K. Rowling “Of course it is happening inside your head, Harry, but why on earth should that mean that it is not real?”
Bob Marley 

Elie Wiesel “The opposite of love is not hate, it's indifference. The opposite of art is not ugliness, it's indifference. The opposite of faith is not heresy, it's indifference. And the opposite of life is not death, it's indifference.”
Friedrich Nietzsche “It is not a lack of love, but a lack of friendship that makes unhappy marriages.”
Mark Twain “Good friends, good books, and a sleepy conscience: this is the ideal life.”
Allen Saunders “Life is what happens to us while we are making other plans.”
Pablo Neruda “I love you without knowing how, or when, or from where. I love you simply, without problems or pride: I love you in this way because I do not know any other way of loving but this, in which there is no I or you, so intimate that your hand upon my chest is my hand, so intimate that when I fall asleep your eyes close.”
Ralph Waldo Emerson “For every minute you are angry you lose sixty seconds of happiness.”
Mother Teresa “If you judge people, you have no time to love them.”
Garr

Jane Austen “There are few people whom I really love, and still fewer of whom I think well. The more I see of the world, the more am I dissatisfied with it; and every day confirms my belief of the inconsistency of all human characters, and of the little dependence that can be placed on the appearance of merit or sense.”
C.S. Lewis “Some day you will be old enough to start reading fairy tales again.”
C.S. Lewis “We are not necessarily doubting that God will do the best for us; we are wondering how painful the best will turn out to be.”
Mark Twain “The fear of death follows from the fear of life. A man who lives fully is prepared to die at any time.”
Mark Twain “A lie can travel half way around the world while the truth is putting on its shoes.”
C.S. Lewis “I believe in Christianity as I believe that the sun has risen: not only because I see it, but because by it I see everything else.”
Albert Einstein “The world as we have created it is a process of our thinking. It cannot be changed wi

KeyboardInterrupt: 

# Example: Tripadvisor

## Get Reviews per User

In [7]:
# https://stackoverflow.com/questions/38788367/trying-to-scrape-tripadvisor-members-using-beautifulsoup
# https://stackoverflow.com/questions/45857311/r-selenium-tripadvisor-detailed-member-info

from selenium import webdriver

driver = webdriver.Firefox()

url = 'https://www.tripadvisor.com/members/387piyalim'

driver.get(url)

next_button = driver.find_element_by_xpath("//li[@data-filter='REVIEWS_RESTAURANTS']")
next_button.click()

results = []

for review in driver.find_elements_by_xpath("//div[@class='cs-content-container']/ul/li"):
    name = review.find_element_by_xpath("div[@class='cs-review-details']/div[@class='cs-review-location']/a").text.encode('utf-8')
    url = review.find_element_by_xpath("div[@class='cs-review-details']/div[@class='cs-review-location']/a").get_attribute('href').encode('utf-8')
    reviewtitle = review.find_element_by_xpath("div[@class='cs-review-details']/a[@class='cs-review-title']").text.encode('utf-8')
    reviewurl = review.find_element_by_xpath("div[@class='cs-review-details']/a[@class='cs-review-title']").get_attribute('href').encode('utf-8')
    reviewdate = review.find_element_by_xpath("div[@class='cs-review-details']/div[@class='cs-review-date']").text.encode('utf-8')
    #rating = review.find_element_by_xpath("div[@class='cs-review-rating']/span").get_attribute('class').encode('utf-8')
    rating = review.find_element_by_xpath("div[@class='cs-review-rating']/span").get_attribute('class').encode('utf-8')
    print(name, url, reviewtitle, reviewurl, reviewdate, rating)
    results.append([name, url, reviewtitle, reviewurl, reviewdate, rating])
    
len(results)

b'Dubai: That Place Cafe DXB' b'https://www.tripadvisor.com/Restaurant_Review-g295424-d13173526-Reviews-That_Place_Cafe_DXB-Dubai_Emirate_of_Dubai.html' b'\xe2\x80\x9cInnovative & Must try Scrumptious Buns\xe2\x80\x9d' b'https://www.tripadvisor.com/ShowUserReviews-g295424-d13173526-r619357598-That_Place_Cafe_DXB-Dubai_Emirate_of_Dubai.html' b'Sep 25, 2018' b'ui_bubble_rating bubble_4'
b'Dubai: Margherita' b'https://www.tripadvisor.com/Restaurant_Review-g295424-d7745900-Reviews-Margherita-Dubai_Emirate_of_Dubai.html' b'\xe2\x80\x9cGood Italian food\xe2\x80\x9d' b'https://www.tripadvisor.com/ShowUserReviews-g295424-d7745900-r609115071-Margherita-Dubai_Emirate_of_Dubai.html' b'Aug 22, 2018' b'ui_bubble_rating bubble_4'
b'Dubai: Karachi Haleem and Biryani' b'https://www.tripadvisor.com/Restaurant_Review-g295424-d8739641-Reviews-Karachi_Haleem_and_Biryani-Dubai_Emirate_of_Dubai.html' b'\xe2\x80\x9cMy favorite Chicken Haleem\xe2\x80\x9d' b'https://www.tripadvisor.com/ShowUserReviews-g295424-

b'Dubai: Jinja Asian Kitchen' b'https://www.tripadvisor.com/Restaurant_Review-g295424-d4008345-Reviews-Jinja_Asian_Kitchen-Dubai_Emirate_of_Dubai.html' b'\xe2\x80\x9cA Melting Pot Of South East Asian Cuisine\xe2\x80\x9d' b'https://www.tripadvisor.com/ShowUserReviews-g295424-d4008345-r215578516-Jinja_Asian_Kitchen-Dubai_Emirate_of_Dubai.html' b'Jul 15, 2014' b'ui_bubble_rating bubble_4'
b'Dubai: Mazina' b'https://www.tripadvisor.com/Restaurant_Review-g295424-d3195374-Reviews-Mazina-Dubai_Emirate_of_Dubai.html' b'\xe2\x80\x9cIndulge in Friday Brunch at Mazina\xe2\x80\x9d' b'https://www.tripadvisor.com/ShowUserReviews-g295424-d3195374-r202301243-Mazina-Dubai_Emirate_of_Dubai.html' b'Apr 22, 2014' b'ui_bubble_rating bubble_5'


24

## Get Reviews from Multiple User Profiles

In [10]:
# https://stackoverflow.com/questions/38788367/trying-to-scrape-tripadvisor-members-using-beautifulsoup
# https://stackoverflow.com/questions/45857311/r-selenium-tripadvisor-detailed-member-info

from selenium import webdriver
driver = webdriver.Firefox()

# multiple urls --> you can read multiple urls from csv file as well
urls = ['https://www.tripadvisor.com/members/387piyalim',
       'https://www.tripadvisor.com/members/CabanaBoyToronto'
      ]

#to write or create a new csv file
output = open('results.csv','wt')
w = csv.writer(output)

for url in urls:

    driver.get(url)

    next_button = driver.find_element_by_xpath("//li[@data-filter='REVIEWS_RESTAURANTS']")
    next_button.click()

    for review in driver.find_elements_by_xpath("//div[@class='cs-content-container']/ul/li"):
        name = review.find_element_by_xpath("div[@class='cs-review-details']/div[@class='cs-review-location']/a").text.encode('utf-8')
        url = review.find_element_by_xpath("div[@class='cs-review-details']/div[@class='cs-review-location']/a").get_attribute('href').encode('utf-8')
        reviewtitle = review.find_element_by_xpath("div[@class='cs-review-details']/a[@class='cs-review-title']").text.encode('utf-8')
        reviewurl = review.find_element_by_xpath("div[@class='cs-review-details']/a[@class='cs-review-title']").get_attribute('href').encode('utf-8')
        reviewdate = review.find_element_by_xpath("div[@class='cs-review-details']/div[@class='cs-review-date']").text.encode('utf-8')
        rating = review.find_element_by_xpath("div[@class='cs-review-rating']/span").get_attribute('class').encode('utf-8')
        print(name, url, reviewtitle, reviewurl, reviewdate, rating)
        w.writerow([name, url, reviewtitle, reviewurl, reviewdate, rating])
    
output.close()

b'Dubai: That Place Cafe DXB' b'https://www.tripadvisor.com/Restaurant_Review-g295424-d13173526-Reviews-That_Place_Cafe_DXB-Dubai_Emirate_of_Dubai.html' b'\xe2\x80\x9cInnovative & Must try Scrumptious Buns\xe2\x80\x9d' b'https://www.tripadvisor.com/ShowUserReviews-g295424-d13173526-r619357598-That_Place_Cafe_DXB-Dubai_Emirate_of_Dubai.html' b'Sep 25, 2018' b'ui_bubble_rating bubble_4'
b'Dubai: Margherita' b'https://www.tripadvisor.com/Restaurant_Review-g295424-d7745900-Reviews-Margherita-Dubai_Emirate_of_Dubai.html' b'\xe2\x80\x9cGood Italian food\xe2\x80\x9d' b'https://www.tripadvisor.com/ShowUserReviews-g295424-d7745900-r609115071-Margherita-Dubai_Emirate_of_Dubai.html' b'Aug 22, 2018' b'ui_bubble_rating bubble_4'
b'Dubai: Karachi Haleem and Biryani' b'https://www.tripadvisor.com/Restaurant_Review-g295424-d8739641-Reviews-Karachi_Haleem_and_Biryani-Dubai_Emirate_of_Dubai.html' b'\xe2\x80\x9cMy favorite Chicken Haleem\xe2\x80\x9d' b'https://www.tripadvisor.com/ShowUserReviews-g295424-

b'Dubai: Mazina' b'https://www.tripadvisor.com/Restaurant_Review-g295424-d3195374-Reviews-Mazina-Dubai_Emirate_of_Dubai.html' b'\xe2\x80\x9cIndulge in Friday Brunch at Mazina\xe2\x80\x9d' b'https://www.tripadvisor.com/ShowUserReviews-g295424-d3195374-r202301243-Mazina-Dubai_Emirate_of_Dubai.html' b'Apr 22, 2014' b'ui_bubble_rating bubble_5'
b"New York City: Junior's Restaurant & Cheesecake" b'https://www.tripadvisor.com/Restaurant_Review-g60763-d14928509-Reviews-Junior_s_Restaurant_Cheesecake-New_York_City_New_York.html' b'\xe2\x80\x9cDelicious and Plentiful Breakfast\xe2\x80\x9d' b'https://www.tripadvisor.com/ShowUserReviews-g60763-d14928509-r613931846-Junior_s_Restaurant_Cheesecake-New_York_City_New_York.html' b'Sep 4, 2018' b'ui_bubble_rating bubble_5'
b'Woodbridge: Pizzeria Gelato Gelato' b'https://www.tripadvisor.com/Restaurant_Review-g562671-d12870734-Reviews-Pizzeria_Gelato_Gelato-Woodbridge_Vaughan_Ontario.html' b'\xe2\x80\x9cBest Italian Ice Cream in Woodbridge\xe2\x80\x9d' b'

In [11]:
df = pd.read_csv('results.csv', header=None)
df

Unnamed: 0,0,1,2,3,4,5
0,b'Dubai: That Place Cafe DXB',b'https://www.tripadvisor.com/Restaurant_Revie...,b'\xe2\x80\x9cInnovative & Must try Scrumptiou...,b'https://www.tripadvisor.com/ShowUserReviews-...,"b'Sep 25, 2018'",b'ui_bubble_rating bubble_4'
1,b'Dubai: Margherita',b'https://www.tripadvisor.com/Restaurant_Revie...,b'\xe2\x80\x9cGood Italian food\xe2\x80\x9d',b'https://www.tripadvisor.com/ShowUserReviews-...,"b'Aug 22, 2018'",b'ui_bubble_rating bubble_4'
2,b'Dubai: Karachi Haleem and Biryani',b'https://www.tripadvisor.com/Restaurant_Revie...,b'\xe2\x80\x9cMy favorite Chicken Haleem\xe2\x...,b'https://www.tripadvisor.com/ShowUserReviews-...,"b'Aug 20, 2018'",b'ui_bubble_rating bubble_4'
3,b'Dubai: Pappa Roti',b'https://www.tripadvisor.com/Restaurant_Revie...,b'\xe2\x80\x9cI am totally bonkers over their ...,b'https://www.tripadvisor.com/ShowUserReviews-...,"b'Aug 19, 2018'",b'ui_bubble_rating bubble_5'
4,b'Dubai: Fistikzade Cafe',b'https://www.tripadvisor.com/Restaurant_Revie...,"b'\xe2\x80\x9cDelectable Turkish Baklava, not ...",b'https://www.tripadvisor.com/ShowUserReviews-...,"b'Aug 17, 2018'",b'ui_bubble_rating bubble_5'
5,b'Dubai: Fish Hut',b'https://www.tripadvisor.com/Restaurant_Revie...,b'\xe2\x80\x9cA must visit for sea food lovers...,b'https://www.tripadvisor.com/ShowUserReviews-...,"b'Aug 15, 2018'",b'ui_bubble_rating bubble_4'
6,b'Dubai: My Shawarma',b'https://www.tripadvisor.com/Restaurant_Revie...,b'\xe2\x80\x9cSmall funky restaurant serving s...,b'https://www.tripadvisor.com/ShowUserReviews-...,"b'Aug 10, 2018'",b'ui_bubble_rating bubble_4'
7,"b""Dubai: Asha's""",b'https://www.tripadvisor.com/Restaurant_Revie...,"b""\xe2\x80\x9cServing Indian Cuisine in it's f...",b'https://www.tripadvisor.com/ShowUserReviews-...,"b'Aug 7, 2018'",b'ui_bubble_rating bubble_4'
8,b'Dubai: Salt',b'https://www.tripadvisor.com/Restaurant_Revie...,b'\xe2\x80\x9cA food truck serving delumptious...,b'https://www.tripadvisor.com/ShowUserReviews-...,"b'Aug 6, 2018'",b'ui_bubble_rating bubble_5'
9,b'Dubai: MTR - Mavalli Tiffin Rooms',b'https://www.tripadvisor.com/Restaurant_Revie...,b'\xe2\x80\x9cBest Rava Idli And Ragi Dosa in ...,b'https://www.tripadvisor.com/ShowUserReviews-...,"b'Aug 6, 2018'",b'ui_bubble_rating bubble_4'


## User Profile

In [62]:
# https://stackoverflow.com/questions/14068119/python-web-crawling

from StringIO import StringIO
import requests
from lxml import etree

response = requests.get("http://www.tripadvisor.in/members/SomersetKeithers")

parser = etree.HTMLParser()
tree   = etree.parse(StringIO(response.text), parser)

def get_definition_description(tree, term):
  description = tree.xpath("//dl[dt/text()='%s']//dd/text()" % term)
  if len(description):
    return description[0].strip()

print get_definition_description(tree, "ageSince:")
print get_definition_description(tree, "Gender:")
print get_definition_description(tree, "Location:")

None
None
None


In [60]:
# https://stackoverflow.com/questions/38788367/trying-to-scrape-tripadvisor-members-using-beautifulsoup
# https://stackoverflow.com/questions/45857311/r-selenium-tripadvisor-detailed-member-info

from selenium import webdriver

driver = webdriver.Firefox()

url = 'https://www.tripadvisor.com/members/387piyalim'

driver.get(url)

for review in driver.find_elements_by_xpath("//div[@id='MODULES_MEMBER_CENTER']"):
    print review.find_element_by_xpath("div[@class='leftProfile']/div/div[@class='profileBlock']/div/div/span").text.encode('utf-8')
    print review.find_element_by_xpath("div[@class='leftProfile']/div/div[@class='profInfo']/div/p").text.encode('utf-8')    

Piyali M
Since Apr 2014


In [61]:
# https://stackoverflow.com/questions/38788367/trying-to-scrape-tripadvisor-members-using-beautifulsoup
# https://stackoverflow.com/questions/45857311/r-selenium-tripadvisor-detailed-member-info

from selenium import webdriver

driver = webdriver.Firefox()

urls = ['https://www.tripadvisor.com/members/387piyalim',
       'https://www.tripadvisor.com/members/CabanaBoyToronto'
      ]

#to write or create a new csv file
output = open('profiles.csv','wb')
w = csv.writer(output)

for url in urls:

    driver.get(url)

    for review in driver.find_elements_by_xpath("//div[@id='MODULES_MEMBER_CENTER']"):
        uid = review.find_element_by_xpath("div[@class='leftProfile']/div/div[@class='profileBlock']/div/div/span").text.encode('utf-8')
        since = review.find_element_by_xpath("div[@class='leftProfile']/div/div[@class='profInfo']/div/p").text.encode('utf-8')
        w.writerow([uid, since])
    
output.close()

# Lab:

**http://www.horsedeathwatch.com/** (Another Javascript-rendered Website)

Collect three columns: 
- hourse name
- date
- course

In [6]:
import requests
from lxml import html

#storing response
response = requests.get('http://www.horsedeathwatch.com/')
data = html.fromstring(response.text)

print(data.xpath('//tr/td[@data-th="Horse"]/a/text'))

[]


No data is returned. You need to use **Selenium**

In [51]:
from selenium import webdriver
import time

driver = webdriver.Firefox()
driver.get("http://www.horsedeathwatch.com/")

xpath(//tbody)

#just wait a little for the browser to be ready
time.sleep(4)

data = []
for i in driver.find_elements_by_xpath("//tbody/tr"):
    name = i.find_element_by_xpath("td[1]/a").text
    date = i.find_element_by_xpath("td[2]").text
    course = i.find_element_by_xpath("td[3]").text
    #print(name, date, course)
    data.append([name, date, course])

df = pd.DataFrame(data)
df







Dodgybingo (IRE) 26 Sep, 2018 Perth
Bullingdon 25 Sep, 2018 Chelmsford
Solatentif (FR) 23 Sep, 2018 Plumpton
Enzos Lad (IRE) 18 Sep, 2018 Kempton AW
Its Pandorama (IRE) 17 Sep, 2018 Hexham
Tilsworth Gold 17 Sep, 2018 Kempton AW
Dorcas 15 Sep, 2018 Chelmsford
Newgate Sioux 15 Sep, 2018 Musselburgh Flat
Commanding Officer 14 Sep, 2018 Doncaster Flat
Sunday In The Park 11 Sep, 2018 Worcester
Walk Waterford 28 Aug, 2018 Stratford
Eyecatcher (IRE) 25 Aug, 2018 Redcar
Smiling Jessica (IRE) 25 Aug, 2018 Cartmel
Volevo Lui 25 Aug, 2018 Chelmsford
Ocean Jive 22 Aug, 2018 Worcester


Unnamed: 0,0,1,2
0,Dodgybingo (IRE),"26 Sep, 2018",Perth
1,Bullingdon,"25 Sep, 2018",Chelmsford
2,Solatentif (FR),"23 Sep, 2018",Plumpton
3,Enzos Lad (IRE),"18 Sep, 2018",Kempton AW
4,Its Pandorama (IRE),"17 Sep, 2018",Hexham
5,Tilsworth Gold,"17 Sep, 2018",Kempton AW
6,Dorcas,"15 Sep, 2018",Chelmsford
7,Newgate Sioux,"15 Sep, 2018",Musselburgh Flat
8,Commanding Officer,"14 Sep, 2018",Doncaster Flat
9,Sunday In The Park,"11 Sep, 2018",Worcester


# References

- http://selenium-python.readthedocs.io/
- https://stackoverflow.com/questions/20986631/how-can-i-scroll-a-web-page-using-selenium-webdriver-in-python

# Appendix: Pagination & Downloading images

In [13]:
# https://hackernoon.com/30-minute-python-web-scraper-39d6d038e5da

import requests
import time
from selenium import webdriver
from PIL import Image
from io import BytesIO

url = "https://unsplash.com"

driver = webdriver.Firefox()
#driver = webdriver.Firefox(executable_path=r'geckodriver.exe')
driver.get(url)

driver.execute_script("window.scrollTo(0,1000);")
time.sleep(5)
image_elements = driver.find_elements_by_css_selector("#gridMulti img")
i = 0

for image_element in image_elements:
    image_url = image_element.get_attribute("src")
    # Send an HTTP GET request, get and save the image from the response
    image_object = requests.get(image_url)
    image = Image.open(BytesIO(image_object.content))
    image.save("//Downloads/download_images/image" + str(i) + "." + image.format, image.format)
    i += 1

FileNotFoundError: [Errno 2] No such file or directory: '//Downloads/download_images/image0.JPEG'