# Scraping Data from Livingsocial 

The purpose of this exercise is to demonstrate some basic web scraping practices using the python programming language. To assist with this exercise we are going to use two 3rd party libraries: An HTTP library called [Requests](http://docs.python-requests.org/en/master/) and a web scraping library called [Beautiful Soup](https://www.crummy.com/software/BeautifulSoup/) ([documentation](https://www.crummy.com/software/BeautifulSoup/bs4/doc/)). 

In [3]:
# import 3rd party libraries for fetching and parsing HTML documents 
from bs4 import BeautifulSoup
import requests

This tutorial will scrape search results from [Livingsocial](https://www.livingsocial.com/browse/cities/49/searches?utf8=%E2%9C%93&city_search_id=49&country_search_id=1&query=&city_name=Pittsburgh), specifically we are interested in collecting all of information about deals in Pittsburgh in a tabular format.

- The base URL is: https://www.livingsocial.com/browse/cities/49/searches?utf8=%E2%9C%93&city_search_id=49&country_search_id=1&query=&city_name=Pittsburgh


Why are we scraping Livingsocial?

http://monocle.livingsocial.com/

In [2]:
# put the base URL for the web scrape into a variable called "urly"
entrypoint = "https://www.livingsocial.com/browse/cities/49/searches?utf8=%E2%9C%93&city_search_id=49&country_search_id=1&query=&city_name=Pittsburgh"

In [3]:
# fetch the web page containing the Livingsocial deals
response = requests.get(entrypoint) 

In [4]:
response.text



In [5]:
# parse the HTML document with Beautiful Soup 
search_results_page = BeautifulSoup(response.content, 'html.parser')


Ok, now we have *fetched* and *parsed* the HTML document we can *extract* data.

What data do we want to extract? How about a list of all the events!

Lets do an *inspect element* on the [listings page](https://www.livingsocial.com/browse/cities/49/searches?utf8=%E2%9C%93&city_search_id=49&country_search_id=1&query=&city_name=Pittsburgh) and see what the HTML structure looks like.

![The Livingsocial deals page for Pittsburgh](livingsocial-listings.png)

If you look carefully you can see the tag for each deal:
`<li dealid="1558890" class="deal-tile facet-active search-result multiple-price-points" data-ga-data="" itemscope="" itemtype="http://schema.org/Offer">` 

indentifies each row in the list of deals. We can use that to select only the information we want from the rest of the page.


  

In [6]:
deals = search_results_page.findAll("li", "deal-tile")
len(deals)

20

Ok, we've extracted 20 deals from the first page of the search results, now we need to extract the relevant information from the HTML structure. Here is what one of those elements looks like:

In [7]:
print(deals[0].prettify())

<li class="deal-tile facet-active search-result" data-ga-data="" dealid="2006190">
 <a class="search-wrapper" href="https://www.livingsocial.com/events/2006190-gl-xeb-third-eye-blind-hard-rock-cafe?pos=0">
  <div class="deal-image">
   <div class="horizontal-img">
    <img alt='XEB plays "Third Eye Blind" on May 18 at 7:30 p.m.' src="https://a0.lscdn.net/imgs/58b493f1-98ac-4c29-8e65-19801218958e/340_q60.jpg">
    </img>
   </div>
   <div class="image-border">
   </div>
  </div>
  <div class="deal-details">
   <h2>
    XEB plays "Third Eye Blind" on May 18 at 7:30 p.m.
   </h2>
   <h3 class="">
    XEB
   </h3>
   <p class="description">
    The Deal

  $10 for one general admission ticket (up to $20.66 value)


XEB


  The Band: XEB is made up of Kevin Cadogan and Arion Salazar, both f...
   </p>
   <p class="location">
    Pittsburgh
   </p>
  </div>
  <div class="deal-prices">
   <div class="from">
    from
   </div>
   <div class="deal-strikethrough-price">
    <sup>
     $
    </su

So we can use Beautiful Soup's `find()` function to extract specific pieces of information from this HTML structure. For more infomation about the find function, see the [Beautiful Soup documentation](https://www.crummy.com/software/BeautifulSoup/bs4/doc/#find). Then, once we have the HTML tag of interest we can get the data of interest.

In [8]:
deal = deals[0]
name = deal.find("div", "deal-details").h2
print(name)

<h2>XEB plays "Third Eye Blind" on May 18 at 7:30 p.m.</h2>


In [9]:
type(name)

bs4.element.Tag

In [10]:
# oops we just want the text content, not the whole element
print(name.text)

XEB plays "Third Eye Blind" on May 18 at 7:30 p.m.


Now we need to write some code that extracts all of the various bits of information from the HTML structure for each of the deals. Looking at the HTML we can see the name, seller, a description, a location, the URL for that specific deal, a price and something called the strikethrough price (to show the savings I guess). 

In [11]:
deal = deals[0]
deal_id = deal['dealid']
name = deal.find("div", "deal-details").h2
seller = deal.find("div", "deal-details").h3
description = deal.find("p", "description")
location = deal.find("p", "location")
url = deal.find("a", "search-wrapper")
price = deal.find("div", "deal-price")
strikethrough_price = deal.find("div", "deal-strikethrough-price")

In [12]:
print(deal_id)

2006190


In [13]:
print(name.text)

XEB plays "Third Eye Blind" on May 18 at 7:30 p.m.


In [14]:
print(seller.text)

XEB


In [15]:
print(description.text)

The Deal

  $10 for one general admission ticket (up to $20.66 value)


XEB


  The Band: XEB is made up of Kevin Cadogan and Arion Salazar, both f...


In [16]:
print(location.text)

Pittsburgh


In [17]:
print(url['href'])

https://www.livingsocial.com/events/2006190-gl-xeb-third-eye-blind-hard-rock-cafe?pos=0


In [18]:
print(price.text)

$10


In [19]:
print(strikethrough_price.text)

$20.66


At this point I want to show you want my screen looks like:

![The process of webscraping](desktop-view.png)

Great! Now that we know how scrape the information from the page, it is time to assemble a "spider" that can "crawl" through multiple search pages.

We've currently scraped 20 deals, but we know by visiting the search page that there are a lot more. We need some code to automatically go to the next page of search results, scrape the deals listings, and repeat. 

We need to find the URL for the next page and then repeat the scraping process.

![HTML for the next button](next-button.png)

Looking at the HTML structure I can see it is very easy to find the next button because it has the CSS class `next_page`.

In [20]:
next_button = search_results_page.find("a", "next_page")

print(next_button['href'])

/browse/cities/49/searches?city_name=Pittsburgh&city_search_id=49&country_search_id=1&page=2&query=&utf8=%E2%9C%93


Sweet! This is all the information I need to build spider/crawler/scraper to automate the process.

In the cells below we can assemble the code from the exploratory analysis to automate the web scraping process. 
The first cell below defines a function for extracting data from the HTML structure of a deal. The second cell 

In [4]:
def extract_deal_data(deal):
    """This function takes the raw deal HTML and 
    extracts eight data points into a python dictionary."""

    data = {}
    try:
        data['id'] = deal['dealid'] 
    except:
        data['id'] = ""
    try:
        data['name'] = deal.find("div", "deal-details").h2.text
    except:
        data['name'] = ""
    try:
        data['seller'] = deal.find("div", "deal-details").h3.text 
    except:
        data['seller'] = ""
    try:
        data['description'] = deal.find("p", "description").text
    except:
        data['description'] = ""
    try:
        data['location'] = deal.find("p", "location").text
    except:
        data['location'] = ""
    try:
        data['url'] = deal.find("a", "search-wrapper")['href']
    except:
        data['url'] = ""
    try:
        data['price'] = deal.find("div", "deal-price").text
    except:
        data['price'] = ""
    try:
        data['strikethrough-price'] = deal.find("div", "deal-strikethrough-price").text
    except:
        data['strikethrough-price'] = ""
    
    return data

In [22]:
extract_deal_data(deal)

{'description': 'The Deal\n\n  $10 for one general admission ticket (up to $20.66 value)\n\n\nXEB\n\n\n  The Band: XEB is made up of Kevin Cadogan and Arion Salazar, both f...',
 'id': '2006190',
 'location': 'Pittsburgh',
 'name': 'XEB plays "Third Eye Blind" on May 18 at 7:30 p.m.',
 'price': '$10',
 'seller': 'XEB',
 'strikethrough-price': '$20.66',
 'url': 'https://www.livingsocial.com/events/2006190-gl-xeb-third-eye-blind-hard-rock-cafe?pos=0'}

In [5]:
# set some needed variables 
base = "https://www.livingsocial.com"
url = "https://www.livingsocial.com/browse/cities/49/searches?utf8=%E2%9C%93&city_search_id=49&country_search_id=1&query=&city_name=Pittsburgh"

# create a global container
all_deals = []

# we are going to loop as long as this variable is true
crawl = True

print("Starting crawl.")

while crawl:
    
    # fetch the page, parse, and get the deals listing
    response = requests.get(url)
    search_results_page = BeautifulSoup(response.content, 'html.parser')
    raw_deals = search_results_page.findAll("li", "deal-tile")
    
    # save the results to a global container
    extracted_deals = [extract_deal_data(deal) for deal in raw_deals]
    all_deals.extend(extracted_deals)
    
    # print periodic crawl updates
    if len(all_deals) % 500 == 0:
        print("Collected %d results so far" % len(all_deals))
    
    # extract the Next button HTML element
    next_button = search_results_page.find("a", "next_page")
    
    # if the CSS class contains disabled, then we've readched the end.
    if 'disabled' in next_button['class']:
        print("Reached the end of the search results. Found %s deals." % len(all_deals))
        
        # setting the crawl variable to false to break the while loop
        crawl = False
        break
    # set the next url to the contents of the next button
    url = base + next_button['href']

print("Crawl completed.")

Starting crawl.
Collected 500 results so far
Collected 1000 results so far
Collected 1500 results so far
Reached the end of the search results. Found 1995 deals.
Crawl completed.


In [24]:
len(all_deals)

1988

In [25]:
# inspect the contents of the first deal
all_deals[1000]

{'description': 'Women’s Cross-Front Ruched Dress\n\n\nSoft, stretchy dress with crossed front\nSubtle ruching\nThree-quarter-length sleeves\nHemline hits around the knee...',
 'id': '176176',
 'location': '',
 'name': "Women's Cross-Front Ruched Dress",
 'price': '$18.99',
 'seller': '',
 'strikethrough-price': '$84',
 'url': 'https://www.livingsocial.com/products/us/tag/fashion/176176-women-s-cross-front-ruched-dress?pos=1000'}

In [26]:
from pandas import DataFrame

In [27]:
clean_data = DataFrame(all_deals)
clean_data.head()

Unnamed: 0,description,id,location,name,price,seller,strikethrough-price,url
0,The Deal\n\n $10 for one general admission ti...,2006190,Pittsburgh,"XEB plays ""Third Eye Blind"" on May 18 at 7:30 ...",$10,XEB,$20.66,https://www.livingsocial.com/events/2006190-gl...
1,The Deal\n\n\n $10 for one general-admission ...,2006176,Pittsburgh,Garry Tallent of The E Street Band on April 30...,$10,Garry Tallent of The E Street Band,$20.66,https://www.livingsocial.com/events/2006176-gl...
2,About This Service Provider\r\nAll About Mass...,1628360,Pittsburgh,Swedish or Therapeutic Massage,$44.99,All About Massage and Wellness,$75,https://www.livingsocial.com/deals/1628360-swe...
3,Why You'll Love It\r\nLearn more about the Pit...,1631636,Pittsburgh,Self-Guided Scavenger Hunt in Pittsburgh,$22,Big City Hunt,$40,https://www.livingsocial.com/deals/1631636-sel...
4,With multiple stages on this illuminated 5K co...,1636438,Pittsburgh,Race Entry Package to Night Nation Run,$29.99,Night Nation Run,$60,https://www.livingsocial.com/events/1636438-ra...


In [28]:
clean_data.to_csv("scraped-data.csv")