# Webscraping 101 (30 minute read)

## Goals of this tutorial:
- How to Scrape Data.
- Installing and Using Selenium.
- How to use Selenium to dynamically interact with websites.
- How to scrape useful information with BeautifulSoup.
- How to moderate use and potential future issues.


For this tutorial, I'll be running through a couple scripts I wrote to scrape hotel reviews off TripAdvisor.<br>
I'll include those as .py files and would recommend running them in Spyder/some IDE. <br>
Usually some unanticipated errors will occur midway through a session, and it's useful to rerun sections of a script, instead of trying to recall everything again from the command line.<br>

<b>Disclaimer: </b>A lot of the methods and advice here are trial and error. Some of the potential issues at the end of this tutorial are ones I haven't encountered yet, but I've seen pop up in other tutorials. Feel more than welcome to add or correct aspects of this tutorial.

## 1. How to Scrape Data

If data is not available through a vendor, but is publicly available on the internet - it's very likely it can be scraped with minimal effort if the hosting site doesn't have an API.

Here's a rough checklist of things to consider for a dataset spread between multiple pages on a site:

1. Is there a pattern for how each instance of the data is stored/located on each page?<br>
2. Is there a pattern for the URL of each page that corresponds to each instance?<br>
3. Can I sample enough of the data in less than 100,000 page accesses?<br>


<b>Ideally</b> all these answers would be yes. It would be great if the URLs have a pattern so that they could be easily generated by a function.<br>
For instance, for an old WeatherUnderground scraper, historical weather data could be accessed by substituting in the day, month, and year into the URL format.</br>

Simple example: try visiting </br>
https://www.wunderground.com/history/airport/KSFO/2017/1/1/DailyHistory.html<br>
https://www.wunderground.com/history/airport/KSFO/2016/1/1/DailyHistory.html

Most sites I've seen are not like this, they use some <b>unique</b> identifiers to refer to distinct locations, dates, geographies, etc.<br>


For instance, with TripAdvisor try these three URLs:
https://www.tripadvisor.com/Hotels-g60898-Atlanta_Georgia-Hotels.html
https://www.tripadvisor.com/Hotels-g60898-New_York_New_York-Hotels.html
https://www.tripadvisor.com/Hotel_Review-g60898-d244079-Reviews-Motel_6_Atlanta-Atlanta_Georgia.html


Trying to modify the first URL to a new location actually redirects the browser back to the original location (Atlanta).<br>
This is likely because the <b>g60898</b> part of the URL is a unique identifier for Atlanta.<br>
Similarly the <b>d244079</b> is likely a unique key for this particular hotel, so there's no clear way to only scrap hotels or locations we're interested in without literally scraping the entire site.





### <b>This is not the end of the world.</b> <br>

So long as there is a way to interact with the website automatically, the site can be crawled and the URLs can be stored and subsequently scraped.<br>
To do this we need to dynamically interact with the website (using <b>Selenium</b>)



## 2. Installing and Testing Selenium for Chrome on Mac 

Simply:

1. pip install selenium<br>
2. brew install webdriver
3. Add chromedriver to path through the instructions __[here](http://www.kenst.com/2015/03/installing-chromedriver-on-mac-osx/)__

Then try the following code:

In [23]:
import selenium
from selenium import webdriver
driver = webdriver.Chrome()
driver.get('https://www.sparkbeyond.com')

You should see a new Chrome window (screenshot below) that opens up SparkBeyond's website. 

If you get the error: <b>ChromeDriver executable needs to be available in the path</b>, try working through the instructions on the above site again. If not, ping me on Slack. 

## 3. How to use Selenium to Dynamically Interact with Websites

Let's continue with that TripAdvisor example. If we want all the Motel 6 locations in a certain state, we can perform a search query through the website and then crawl through the results to store all relevant URLs. Let's practice doing this in Selenium, because eventually we have to loop through all 50 states. <b>(We're searching by State because any query is capped after 990 Hotels, so the results from an entire United States search query are incomplete.)</b>
<br>

To do this, we'll iteratively:
1. Think about how we would interact with a page.
2. Right (or Control) clicking to <b>inspect</b> the element we would interact with in the page source code.
3. Code the action in Selenium.
4. Run the code and validate the action was performed in the browser.<br>


In [24]:
driver.get('https://www.tripadvisor.com')

From the page source below, it looks like we have to click on this element, to open a search page.
![test](TA_Images/TripAdvisorSearch.png "Title")


In [26]:
button = driver.find_element_by_class_name("mag_glass_parent")
button.click()

Now we'll type in queries to the two new fields that appeared.

<b>Note sometimes elements are not present until a certain action is performed on a webpage. There's a wait function to hold until this element is present before continuing, but I've found it's generally easier to just use a sleep timer.</b>

![test](TA_Images/TripAdvisorInputField.png "Title")


In [27]:
import time
import numpy as np

hotel = 'Motel 6'
state = 'Texas'

time.sleep(np.random.lognormal(0,0.5) + 0.2)

#Type the hotel name
search_field = driver.find_element_by_id('mainSearch')
search_field.send_keys(hotel)
time.sleep(np.random.lognormal(0,0.5) + 0.2)

#Type the city name
location_field = driver.find_element_by_id('GEO_SCOPED_SEARCH_INPUT')
location_field.send_keys(state)

You should see both fields inputed in your browser as seen below:

![test](TA_Images/TripAdvisorInputFieldDone.png "Title")


In [28]:
search_button = driver.find_element_by_id('SEARCH_BUTTON_CONTENT')
search_button.click()

# 4. How to scrape useful information with BeautifulSoup

So now we've made it to the review page:

![test](TA_Images/TripAdvisorSearchResults.png "Title")

<br>
It's time to look into the page source and find the URLs that point to pages that contain information about each individual hotel.
<br>

We're looking for a <b>unique</b> tag that contains information about the URL for each hotel. We can see on the right of the screenshot that:

corresponds to a xml tag that includes the URL to the review page. Using BeautifulSoup, we can easily parse the page source for any instances of this tag, and then parse each individual one.

In [36]:
from bs4 import BeautifulSoup
import re

html = driver.page_source
soup = BeautifulSoup(html, 'lxml') 
hotel_locations = soup.find_all('div', {'class' : 'title'})

hotel_urls = []
for location in hotel_locations:
    hotel_urls.append("https://www.tripadvisor.com" + location.get('onclick').split(',')[-1].strip().strip("'")[:-10])

hotel_urls[0:4]

['https://www.tripadvisor.com/Hotel_Review-g55856-d244353-Reviews-Motel_6_Ft_Stockton-Fort_Stockton_Texas.html',
 'https://www.tripadvisor.com/Hotel_Review-g56056-d1177217-Reviews-Motel_6_Junction-Junction_Texas.html',
 'https://www.tripadvisor.com/Hotel_Review-g55505-d2555911-Reviews-Motel_6_Boerne-Boerne_Texas.html',
 'https://www.tripadvisor.com/Hotel_Review-g55863-d1528531-Reviews-Motel_6_Fredericksburg-Fredericksburg_Texas.html']

### Important Note
Webpages change all the time, for instance this used to be the previous tag for hotel pages a week before I wrote this notebook was:<br>

hotel_locations = soup.find_all('a', href = re.compile('/Hotel_Review'))<br>

This may change again by the time you view this notebook.<br>


## Scrapping JSON and XML Information

Let's continue and scrape all potentially relevant data from the hotel page. It's important to take a close look at the page source, because sometimes a lot of information is in a <b>really nice JSON dictionary that very easy to parse.</b>
<br>
Other information can be very easily scraped by identifying a unique xml tag for that particular field, like "div class = highlightedAmenity detailListItem", and then using .find_all to grab all instances of that information. 
<br><b>BeautifulSoup</b> has several methods (like .next, .string) that can be chained to eventually get to the information you need. It'll look a bit ugly, but sites update sometimes frequently and quick and dirty might be the way to go.


![test](TA_Images/TripAdvisorJSONDict.png "Title")

In [44]:
import json
from collections import OrderedDict

hotel_page = np.random.choice(hotel_urls)
driver.get(hotel_page)
html = driver.page_source
soup = BeautifulSoup(html, 'lxml')

review_content = soup.find('script', type='application/ld+json')
review_dict = json.loads(review_content.text)

scrapped_amenities = soup.find_all('div', {'class': 'highlightedAmenity detailListItem'})
hotel_amenities = []
for amenity in scrapped_amenities:
    hotel_amenities.append(amenity.next)

review_topics = soup.find_all('span', {'class': 'ui_tagcloud fl'})
hotel_topics = []
for topic in review_topics:
    hotel_topics.append(topic.next)

hotel_info = OrderedDict(
                 {'city'            : review_dict['address']['addressLocality'],
                  'state'           : review_dict['address']['addressRegion'],
                  'zip_code'        : review_dict['address']['postalCode'],
                  'address'         : review_dict['address']['streetAddress'],
                  'name'            : review_dict['name'],
                  'price_range'     : review_dict['priceRange'],
                  'rating_value'    : review_dict['aggregateRating']['ratingValue'],
                  'review_count'    : review_dict['aggregateRating']['reviewCount'],
                  'review_topics'   : hotel_topics,
                  'hotel_amenities' : hotel_amenities,
                  })

hotel_info

OrderedDict([('city', 'Abilene'),
             ('state', 'Texas'),
             ('zip_code', '79603-2305'),
             ('address', '4951 W Stamford St'),
             ('name', 'Motel 6 Abilene'),
             ('price_range',
              '$48 - $59 (Based on Average Rates for a Standard Room)'),
             ('rating_value', '3.0'),
             ('review_count', '66'),
             ('review_topics', []),
             ('hotel_amenities',
              ['Wifi',
               'Free Parking',
               'Air Conditioning',
               'Pool',
               'Non-Smoking Rooms'])])

### Appending to a File

Now have a nice Ordered Dictionary, that can easily be appended to a file containing the scraped data using <b>csv.DictWriter</b>.

# 5. Moderating Use and Potential Future Issues

### Moderating Use
So eventually a site can block multiple, repeated requests from the same user. To prevent this I would recommend:
1. Sleep timers with lognormal distributions (preferably something like np.random.lognormal(1,0.5) + 2) as a starting point as timer between each action.
2. Using a VPN - there are many free private VPN services (such as this one [here](https://www.tunnelbear.com/))
3. Rewriting to a file after each page scrape to prevent losing significant amounts of data if there's an error or disconnection.

### Potential Future Issues
1. IP Ban (I believe using a VPN, especially one with built in IP cycling will prevent this.
2. Sophisicated scraping detection tools (Distill Networks) that automatically detect Selenium - potential fix [here.](https://stackoverflow.com/questions/33225947/can-a-website-detect-when-you-are-using-selenium-with-chromedriver)
3. Complicated Javascript - apparently sometimes it's much easier to access the mobile site and scrape from there.

# 6. Future Directions

Eventually I'll add the scripts that should give a good template to modify from and write your own scraper within a couple hours.<br>
An easy way to build a scraper is:<br>
- Identify endpoint URLs that contain relevant information
- Write a script for the endpoint URL
- Determine a search result page that will contain all endpoint URL Locations
- Write a script to crawl through the search page and scrape URLs
- Initialize a file and append URLs to the file
- Load the URLs into a dataframe and iteratively scrape information from each endpoint URL