# Scraping Livingsocial data

The purpose of this exercise is to demonstrate some basic web scraping practices using the python programming language. To assist with this exercise we are going to use two 3rd party libraries: An HTTP library called [Requests](http://docs.python-requests.org/en/master/) and a web scraping library called [Beautiful Soup](https://www.crummy.com/software/BeautifulSoup/) ([documentation](https://www.crummy.com/software/BeautifulSoup/bs4/doc/)). 

In [None]:
# import 3rd party libraries for fetching and parsing HTML documents 
from bs4 import BeautifulSoup
import requests

This tutorial will scrape search results from [Livingsocial](https://www.livingsocial.com/browse/cities/49/searches?utf8=%E2%9C%93&city_search_id=49&country_search_id=1&query=&city_name=Pittsburgh), specifically we are interested in collecting all of information about deals in Pittsburgh in a tabular format.

- The base URL is: https://www.livingsocial.com/browse/cities/49/searches?utf8=%E2%9C%93&city_search_id=49&country_search_id=1&query=&city_name=Pittsburgh


Why are we scraping Livingsocial?

http://monocle.livingsocial.com/

In [None]:
# put the base URL for the web scrape into a variable called "urly"
entrypoint = "https://www.livingsocial.com/browse/cities/49/searches?utf8=%E2%9C%93&city_search_id=49&country_search_id=1&query=&city_name=Pittsburgh"

In [None]:
# fetch the web page containing the Livingsocial deals
response = requests.get(entrypoint) 

In [None]:
response.text

In [None]:
# parse the HTML document with Beautiful Soup 
search_results_page = BeautifulSoup(response.content, 'html.parser')


Ok, now we have *fetched* and *parsed* the HTML document we can *extract* data.

What data do we want to extract? How about a list of all the events!

Lets do an *inspect element* on the [listings page](https://www.livingsocial.com/browse/cities/49/searches?utf8=%E2%9C%93&city_search_id=49&country_search_id=1&query=&city_name=Pittsburgh) and see what the HTML structure looks like.

![The Livingsocial deals page for Pittsburgh](livingsocial-listings.png)

If you look carefully you can see the tag for each deal:
`<li dealid="1558890" class="deal-tile facet-active search-result multiple-price-points" data-ga-data="" itemscope="" itemtype="http://schema.org/Offer">` 

indentifies each row in the list of deals. We can use that to select only the information we want from the rest of the page.


  

In [None]:
deals = search_results_page.findAll("li", "deal-tile")
len(deals)

Ok, we've extracted 20 deals from the first page of the search results, now we need to extract the relevant information from the HTML structure. Here is what one of those elements looks like:

In [None]:
print(deals[0].prettify())

So we can use Beautiful Soup's `find()` function to extract specific pieces of information from this HTML structure. For more infomation about the find function, see the [Beautiful Soup documentation](https://www.crummy.com/software/BeautifulSoup/bs4/doc/#find). Then, once we have the HTML tag of interest we can get the data of interest.

In [None]:
deal = deals[0]
name = deal.find("h2", itemprop="name")
print(name)

In [None]:
type(name)

In [None]:
# oops we just want the text content, not the whole element
print(name.text)

Now we need to write some code that extracts all of the various bits of information from the HTML structure for each of the deals. Looking at the HTML we can see the name, seller, a description, a location, the URL for that specific deal, a price and something called the strikethrough price (to show the savings I guess). 

In [None]:
deal = deals[0]
deal_id = deal['dealid']
name = deal.find("h2", itemprop="name")
seller = deal.find("h3", itemprop="seller")
description = deal.find("p", "description")
location = deal.find("p", "location")
url = deal.find("a", "search-wrapper")
price = deal.find("div", "deal-price")
strikethrough_price = deal.find("div", "deal-strikethrough-price")

In [None]:
print(deal_id)

In [None]:
print(name.text)

In [None]:
print(seller.text)

In [None]:
print(description.text)

In [None]:
print(location.text)

In [None]:
print(url['href'])

In [None]:
print(price.text)

In [None]:
print(strikethrough_price.text)

At this point I want to show you want my screen looks like:

![The process of webscraping](desktop-view.png)

Great! Now that we know how scrape the information from the page, it is time to assemble a "spider" that can "crawl" through multiple search pages.

We've currently scraped 20 deals, but we know by visiting the search page that there are a lot more. We need some code to automatically go to the next page of search results, scrape the deals listings, and repeat. 

We need to find the URL for the next page and then repeat the scraping process.

![HTML for the next button](next-button.png)

Looking at the HTML structure I can see it is very easy to find the next button because it has the CSS class `next_page`.

In [None]:
next_button = search_results_page.find("a", "next_page")

print(next_button['href'])

Sweet! This is all the information I need to build spider/crawler/scraper to automate the process.

In the cells below we can assemble the code from the exploratory analysis to automate the web scraping process. 
The first cell below defines a function for extracting data from the HTML structure of a deal. The second cell 

In [None]:
def extract_deal_data(deal):
    """This function takes the raw deal HTML and 
    extracts eight data points into a python dictionary."""

    data = {}
    try:
        data['id'] = deal['dealid'] 
    except:
        data['id'] = ""
    try:
        data['name'] = deal.find("h2", itemprop="name").text
    except:
        data['name'] = ""
    try:
        data['seller'] = deal.find("h3", itemprop="seller").text 
    except:
        data['seller'] = ""
    try:
        data['description'] = deal.find("p", "description").text
    except:
        data['description'] = ""
    try:
        data['location'] = deal.find("p", "location").text
    except:
        data['location'] = ""
    try:
        data['url'] = deal.find("a", "search-wrapper")['href']
    except:
        data['url'] = ""
    try:
        data['price'] = deal.find("div", "deal-price").text
    except:
        data['price'] = ""
    try:
        data['strikethrough-price'] = deal.find("div", "deal-strikethrough-price").text
    except:
        data['strikethrough-price'] = ""
    
    return data

In [None]:
extract_deal_data(deal)

In [None]:
# set some needed variables 
base = "https://www.livingsocial.com"
url = "https://www.livingsocial.com/browse/cities/49/searches?utf8=%E2%9C%93&city_search_id=49&country_search_id=1&query=&city_name=Pittsburgh"

# create a global container
all_deals = []

# we are going to loop as long as this variable is true
crawl = True

print("Starting crawl.")

while crawl:
    
    # fetch the page, parse, and get the deals listing
    response = requests.get(url)
    search_results_page = BeautifulSoup(response.content, 'html.parser')
    raw_deals = search_results_page.findAll("li", "deal-tile")
    
    # save the results to a global container
    extracted_deals = [extract_deal_data(deal) for deal in raw_deals]
    all_deals.extend(extracted_deals)
    
    # print periodic crawl updates
    if len(all_deals) % 500 == 0:
        print("Collected %d results so far" % len(all_deals))
    
    # extract the Next button HTML element
    next_button = search_results_page.find("a", "next_page")
    
    # if the CSS class contains disabled, then we've readched the end.
    if 'disabled' in next_button['class']:
        print("Reached the end of the search results. Found %s deals." % len(all_deals))
        
        # setting the crawl variable to false to break the while loop
        crawl = False
        break
    # set the next url to the contents of the next button
    url = base + next_button['href']

print("Crawl completed.")

In [None]:
len(all_deals)

In [None]:
# inspect the contents of the first deal
all_deals[1000]

In [None]:
from pandas import DataFrame

In [None]:
clean_data = DataFrame(all_deals)
clean_data.head()

In [None]:
clean_data.to_csv("scraped-data.csv")