# Web scraping

**Date: 28 March 2017**

@author: Daniel Csaba


## Preliminaries 

Import usual packages.  

In [1]:
import pandas as pd             # data package
import matplotlib.pyplot as plt # graphics 
import datetime as dt           # date tools, used to note current date  

%matplotlib inline

We have seen how to input data from `csv` and `xls` files -- either online or from our computer and through APIs. Sometimes the data is only available as specific part of a website.

We want to access the source code of the website and systematically extract the relevant information.

Again, use Google fu to find useful links. Here are a couple:
* [link 1](https://www.dataquest.io/blog/web-scraping-tutorial-python/)
* [link 2](http://blog.danwin.com/examples-of-web-scraping-in-python-3-x-for-data-journalists/)
* [link 3](https://www.analyticsvidhya.com/blog/2015/10/beginner-guide-web-scraping-beautiful-soup-python/)

## Structure of web pages (very simplistic)

`Hypertext Markup Language` (HTML) specifies the structure and main content of the site -- tells the browser how to layout content. Think of `Markdown`.

It is structured using tags.

```html
<html>
    <head>
        (Meta) Information about the page.
    </head>
    <body>
        <p>
            This is a paragraph.
        </p>
        <table>
            This is a table
        </table>
    </body>
</html>
```

`Tag`s determine the content and layout depending on their relation to other tags. Useful terminology:

* `child` -- a child is a tag inside another tag. The `p` tag above is a child of the `body` tag.
* `parent` --  a parent is the tag another tag is inside. The `body` tag above is a parent of the `p` tag.
* `sibling` -- a sibling is a tag that is nested inside the same parent as another tag. The `head` and `body` tags above are siblings.

There are many different tags -- take a look at a [reference list](https://developer.mozilla.org/en-US/docs/Web/HTML/Element). You won't and shouldn't remember all of them but it's useful to have a rough idea about them.

And take a look at a real example -- open page, then right click:  "View Page Source"

In the real example you will see that there is more information after the tag, most commanly a `class` and an `id`. Something similar to the following:

```html
<html>
    <head class='main-head'>
        (Meta) Information about the page.
    </head>
    <body>
        <p class='inner-paragraph' id='001'>
            This is a paragraph.
        </p>
        <table class='inner-table' id='002'>
            This is a table
        </table>
    </body>
</html>
```
The `class` and `id` information will help us in locating the information we are looking for in a systematic way.

Useful way to explore the `html` and the corresponding website is right clicking on the web page and then clicking on `Inspect element` -- interpretation of the html by the browser


Suppose we want to check prices for renting a room in Manhattan in Craigslist. Let's check for example the `rooms & shares` section for the [East Village](https://newyork.craigslist.org/search/roo?query=east+village&availabilityMode=0).

## Accessing web pages 

We have to download the content of the webpage -- i.e. get the contents structured by the HTML. This we can do with the `requests` library, which is a human readable HTTP (HyperText Transfer Protocol) library for python. You cna find the Quickstart Documentation [here](http://docs.python-requests.org/en/master/user/quickstart/).

In [2]:
import requests # you might have to install this

In [3]:
url = 'https://newyork.craigslist.org/search/roo?query=east+village&availabilityMode=0'
cl = requests.get(url)

You might want to query for different things and download information for all of them. You can pass this as extra information.

In [4]:
url = 'https://newyork.craigslist.org/search/roo'
keys = {'query' : 'east village', 'availabilityMode' : '0'}
cl_extra = requests.get(url, params=keys)

In [5]:
# see if the URL was specified successfully
cl_extra.url

'https://newyork.craigslist.org/search/roo?availabilityMode=0&query=east+village'

In [6]:
type(cl)

requests.models.Response

In [7]:
cl

<Response [200]>

The `[200]` stands for the `status_code` which carries information whether the download was succesful. If it starts with 2 it's a good sign, 4 or 5 not so much.

In [8]:
cl.status_code

200

Check tab completion

In [9]:
cl.url

'https://newyork.craigslist.org/search/roo?query=east+village&availabilityMode=0'

This is going to be ugly and unreadable

In [None]:
cl.text

In [None]:
cl.content # this works also for information which is not purely text

## Extracting information from a web page 

Now that we have the content of the web page we want to extraxt certain information. `BeautifulSoup` is a Python package which helps us in doing that. See the [documentation](https://www.crummy.com/software/BeautifulSoup/bs4/doc/) for more information.


In [10]:
from bs4 import BeautifulSoup

In [None]:
BeautifulSoup?

In [11]:
cl_soup = BeautifulSoup(cl.content, 'html.parser')

Print this out in a prettier way.

In [None]:
print(cl_soup.prettify())

In [12]:
print('Type:', type(cl_soup))

Type: <class 'bs4.BeautifulSoup'>


In [13]:
# we can access a tag 
print('Title: ', cl_soup.title)

Title:  <title>new york rooms for rent &amp; shares available "east village" - craigslist</title>


In [14]:
# or only the text content
print('Title: ', cl_soup.title.text) # or
print('Title: ', cl_soup.title.get_text())

Title:  new york rooms for rent & shares available "east village" - craigslist
Title:  new york rooms for rent & shares available "east village" - craigslist


We can find all tags of certain type with the `find_all` method. This returns a list.

In [None]:
cl_soup.find_all?

To get the first paragraph in the html write

In [15]:
cl_soup.find_all('p')[0]

<p class="result-info">
<span class="icon icon-star" role="button">
<span class="screen-reader-text">favorite this post</span>
</span>
<time class="result-date" datetime="2017-03-30 11:28" title="Thu 30 Mar 11:28:04 AM">Mar 30</time>
<a class="result-title hdrlnk" data-id="6066575750" href="/mnh/roo/6066575750.html">** Free laundry! - MOVE IN to an East Village, PreWar Apt, etc -- PICS</a>
<span class="result-meta">
<span class="result-price">$1850</span>
<span class="housing">
                    950ft<sup>2</sup> -
                </span>
<span class="result-hood"> (East Village)</span>
<span class="result-tags">
                    pic
                    <span class="maptag" data-pid="6066575750">map</span>
</span>
<span class="banish icon icon-trash" role="button">
<span class="screen-reader-text">hide this posting</span>
</span>
<span aria-hidden="true" class="unbanish icon icon-trash red" role="button"></span>
<a class="restore-link" href="#">
<span class="restore-narrow-text">r

This is a lot of information and we want to extract some part of it. Use the `text` or `get_text()` method to get the text content.

In [16]:
cl_soup.find_all('p')[0].get_text()

'\n\nfavorite this post\n\nMar 30\n** Free laundry! - MOVE IN to an East Village, PreWar Apt, etc -- PICS\n\n$1850\n\n                    950ft2 -\n                \n (East Village)\n\n                    pic\n                    map\n\n\nhide this posting\n\n\n\nrestore\nrestore this posting\n\n\n'

This is still messy. We will need a smarter search.

We can also access the `children` of a certain tag. For example here are the children of the first paragraph tag.

In [17]:
list(cl_soup.find_all('p')[0].children)

['\n', <span class="icon icon-star" role="button">
 <span class="screen-reader-text">favorite this post</span>
 </span>, '\n', <time class="result-date" datetime="2017-03-30 11:28" title="Thu 30 Mar 11:28:04 AM">Mar 30</time>, '\n', <a class="result-title hdrlnk" data-id="6066575750" href="/mnh/roo/6066575750.html">** Free laundry! - MOVE IN to an East Village, PreWar Apt, etc -- PICS</a>, '\n', <span class="result-meta">
 <span class="result-price">$1850</span>
 <span class="housing">
                     950ft<sup>2</sup> -
                 </span>
 <span class="result-hood"> (East Village)</span>
 <span class="result-tags">
                     pic
                     <span class="maptag" data-pid="6066575750">map</span>
 </span>
 <span class="banish icon icon-trash" role="button">
 <span class="screen-reader-text">hide this posting</span>
 </span>
 <span aria-hidden="true" class="unbanish icon icon-trash red" role="button"></span>
 <a class="restore-link" href="#">
 <span class="r

Look for tags based on their class. This is extremely useful for efficiently locating information.

In [18]:
cl_soup.find_all('span', class_='result-price')[0].get_text()

'$1850'

In [19]:
prices = cl_soup.find_all('span', class_='result-price')

In [20]:
price_data = [price.get_text() for price in prices]

In [21]:
price_data[:10]

['$1850',
 '$1850',
 '$1325',
 '$1325',
 '$1800',
 '$1800',
 '$1300',
 '$1300',
 '$1400',
 '$1400']

In [None]:
len(price_data)

We are getting more cells than we want -- there were only 120 listings on the page. Check the ads with "Inspect Element".

In [22]:
cl_soup.find_all('li', class_='result-row')[0]

<li class="result-row" data-pid="6066575750">
<a class="result-image gallery" data-ids="1:00T0T_k7bL8hZs1j1,1:00V0V_lRDosVbZz3w,1:00D0D_8kYoiRCvuq3,1:00707_fRAnNuWmeWM,1:00404_7NNtKdWfl53,1:00505_bBVMGdcDUZc,1:00v0v_lY5TwZAj4BJ,1:00H0H_1ZVbrc0310S,1:00k0k_j8mBgJNsd3S,1:00M0M_kMzynyz8wT,1:00W0W_3v5iPE4lQ44,1:00a0a_8l5AknuqV2G,1:00p0p_2DfCOT9TAsF,1:00B0B_3RBDtCuLNGE,1:00L0L_4ZNpbrv19qn" href="/mnh/roo/6066575750.html">
<span class="result-price">$1850</span>
</a>
<p class="result-info">
<span class="icon icon-star" role="button">
<span class="screen-reader-text">favorite this post</span>
</span>
<time class="result-date" datetime="2017-03-30 11:28" title="Thu 30 Mar 11:28:04 AM">Mar 30</time>
<a class="result-title hdrlnk" data-id="6066575750" href="/mnh/roo/6066575750.html">** Free laundry! - MOVE IN to an East Village, PreWar Apt, etc -- PICS</a>
<span class="result-meta">
<span class="result-price">$1850</span>
<span class="housing">
                    950ft<sup>2</sup> -
           

In [23]:
ads = cl_soup.find_all('li', class_='result-row')

In [24]:
data = [[ad.find('a', class_='result-title hdrlnk').get_text(), 
         ad.find('a', class_='result-title hdrlnk')['data-id'], 
         ad.find('span', class_='result-price').get_text()] for ad in ads ]

AttributeError: 'NoneType' object has no attribute 'get_text'

What's going wrong? Some ads don't have a price listed, so we can't retrieve it.

In [25]:
# if it exists then the type is
type(ads[0].find('span', class_='result-price'))

bs4.element.Tag

In [26]:
import bs4

data = [[ad.find('a', class_='result-title hdrlnk').get_text(), 
         ad.find('a', class_='result-title hdrlnk')['data-id'], 
         ad.find('span', class_='result-price').get_text()] for ad in ads 
        if type(ad.find('span', class_='result-price'))==bs4.element.Tag]

In [27]:
df = pd.DataFrame(data)

In [28]:
df.head(10)

Unnamed: 0,0,1,2
0,"** Free laundry! - MOVE IN to an East Village,...",6066575750,$1850
1,No Fee!Queen Sized Rm in East Village For $132...,6063126853,$1325
2,Furnished Room in East Village Doorman Building,6057888332,$1800
3,Beautiful sunny bedroom in great East Village/...,6066076874,$1300
4,"Queen room in 2BR, large living room, LES / Ea...",6055449881,$1400
5,$1615 Bedroom available in East Village Bldg,6065914838,$1615
6,DOORMAN BLDG Sunny Bright Room in East Village...,6057762844,$1295
7,East Village Prime Location Downtown Living at...,6065616266,$1895
8,EAST VILLAGE- Bedroom in Amazing 3bdrm w priva...,6065488224,$1415
9,"Great East Village Location, Facing Garden!",6050790054,$1100


In [29]:
df.shape

(118, 3)

We only have 118 listing because 2 listings did not have a price.

In [30]:
df.columns = ['Title', 'ID', 'Price']

In [31]:
df.head()

Unnamed: 0,Title,ID,Price
0,"** Free laundry! - MOVE IN to an East Village,...",6066575750,$1850
1,No Fee!Queen Sized Rm in East Village For $132...,6063126853,$1325
2,Furnished Room in East Village Doorman Building,6057888332,$1800
3,Beautiful sunny bedroom in great East Village/...,6066076874,$1300
4,"Queen room in 2BR, large living room, LES / Ea...",6055449881,$1400


We could do text anaylsis and see what words are common in ads which has a relatively higher price.

This approach is not really efficient because it only gets the first page of the search results. We see on the top of the CL page the total number of listings. In the `Inspection` mode we can pick an element from the page and check how it is defined in the `html` -- this is useful to get tags and classes efficiently.

For example, the total number of ads is a `span` tag with a 'totalcount' `class`.

In [32]:
cl_soup.find('span', class_='totalcount')

<span class="totalcount">559</span>

We can see if we start clicking on the 2nd nd 3rd pages of the results that there is a structure in how they are defined

First page:

https://newyork.craigslist.org/search/roo?query=east+village&availabilityMode=0

Second page:

https://newyork.craigslist.org/search/roo?s=120&availabilityMode=0&query=east%20village

Third page:

https://newyork.craigslist.org/search/roo?s=240&availabilityMode=0&query=east%20village


The number after `roo?s=` in the domain specifies where the listings are starting from (not inclusive). In fact, if we modify it ourselves we can fine-tune the page starting from the corresponding listing and then showing 120 listings. Try it!

We can also define the first page by puttig`s=0&` after `roo?` like this:

https://newyork.craigslist.org/search/roo?s=0&availabilityMode=0&query=east%20village


In [34]:
# First we get the total number of listings in real time
url = 'https://newyork.craigslist.org/search/roo?query=east+village&availabilityMode=0'
cl = requests.get(url)
cl_soup = BeautifulSoup(cl.content, 'html.parser')
total_count = int(cl_soup.find('span', class_='totalcount').get_text())
print(total_count)

559


We have the total number of listings with the given search specification. Breaking down the steps:

1) Specify the url of each page we want to scrape

2) For each page scrape the data -- we will reuse the code what we already have for one page

3) Save the data into one dataframe -- we can use the `append` method for DataFrames or the `extend` method for lists

In [35]:
# 1) Specify the url
for page in range(0, total_count, 120):
    print('https://newyork.craigslist.org/search/roo?s={}&availabilityMode=0&query=east%20village'.format(page))


https://newyork.craigslist.org/search/roo?s=0&availabilityMode=0&query=east%20village
https://newyork.craigslist.org/search/roo?s=120&availabilityMode=0&query=east%20village
https://newyork.craigslist.org/search/roo?s=240&availabilityMode=0&query=east%20village
https://newyork.craigslist.org/search/roo?s=360&availabilityMode=0&query=east%20village
https://newyork.craigslist.org/search/roo?s=480&availabilityMode=0&query=east%20village


In [36]:
# Next we write a loop to scrape all pages

df = pd.DataFrame({'Title' : [], 'ID' : [], 'Price' : []})
for page in range(0, total_count, 120):
    url = 'https://newyork.craigslist.org/search/roo?s={}&availabilityMode=0&query=east%20village'.format(page)
    cl = requests.get(url)
    cl_soup = BeautifulSoup(cl.content, 'html.parser')
    ads = cl_soup.find_all('li', class_='result-row')
    data = pd.DataFrame([[ad.find('a', class_='result-title hdrlnk').get_text(), 
         ad.find('a', class_='result-title hdrlnk')['data-id'], 
         ad.find('span', class_='result-price').get_text()] for ad in ads 
        if type(ad.find('span', class_='result-price'))==bs4.element.Tag], 
                       columns=['Title', 'ID', 'Price'])
    df = df.append(data, ignore_index=True)

In [37]:
df.head()

Unnamed: 0,ID,Price,Title
0,6066575750,$1850,"** Free laundry! - MOVE IN to an East Village,..."
1,6063126853,$1325,No Fee!Queen Sized Rm in East Village For $132...
2,6057888332,$1800,Furnished Room in East Village Doorman Building
3,6066076874,$1300,Beautiful sunny bedroom in great East Village/...
4,6055449881,$1400,"Queen room in 2BR, large living room, LES / Ea..."


In [38]:
# Do the same using the `extend` method

data = []
for page in range(0, total_count, 120):
    url = 'https://newyork.craigslist.org/search/roo?s={}&availabilityMode=0&query=east%20village'.format(page)
    cl = requests.get(url)
    cl_soup = BeautifulSoup(cl.content, 'html.parser')
    ads = cl_soup.find_all('li', class_='result-row')
    data_page = [[ad.find('a', class_='result-title hdrlnk').get_text(), 
         ad.find('a', class_='result-title hdrlnk')['data-id'], 
         ad.find('span', class_='result-price').get_text()] for ad in ads 
        if type(ad.find('span', class_='result-price'))==bs4.element.Tag]
    data.extend(data_page)
    
df = pd.DataFrame(data, columns=['Title', 'ID', 'Price'])

In [39]:
df.head()

Unnamed: 0,Title,ID,Price
0,"** Free laundry! - MOVE IN to an East Village,...",6066575750,$1850
1,No Fee!Queen Sized Rm in East Village For $132...,6063126853,$1325
2,Furnished Room in East Village Doorman Building,6057888332,$1800
3,Beautiful sunny bedroom in great East Village/...,6066076874,$1300
4,"Queen room in 2BR, large living room, LES / Ea...",6055449881,$1400


In [40]:
df.shape

(540, 3)

In [41]:
df.tail()

Unnamed: 0,Title,ID,Price
535,"Spacious NYC Apartments, Huge Rooms, Utilities...",6056402218,$1400
536,Open house 415.. Private room..... furnished a...,6050402498,$1199
537,2nd AVE GAY MALE LOFT FOR IMMEDIATELY,6013067229,$1599
538,"Large room in spacious 2-bdr, great Location i...",6056318069,$2500
539,HUGE Bedrooms - Roommates needed for Immediate...,5977845665,$1400


We have scraped all the listings from CL in section "Rooms and Shares" for the East Village.

## Exercise

Suppose you have a couple of destinations in mind and you want to check the weather for each of them for this Friday. You want to get it from the [National Weather Service](http://www.weather.gov/).

These are the places I want to check (suppose there are many more and you want to automate it):

```python
locations = ['Roberta\'s, Brooklyn', 'White Sands National Monument', 'Bronx Botanical Garden']
```

It seems that the NWS is using latitude and longitude coordinates in its search.

i.e. for White Sands
http://forecast.weather.gov/MapClick.php?lat=32.38092788700044&lon=-106.4794398029997

Would be cool to pass these on as arguments.

After some Google fu (i.e. "latitude and longitude of location python") find a post by [Chris Albon](https://chrisalbon.com/python/geocoding_and_reverse_geocoding.html) which describes exactly what we want.

Install `pygeocoder` through `pip install pygeocoder` (from `conda` only the OSX version is available).

In [43]:
from pygeocoder import Geocoder

In [44]:
# check for one of the locations how it's working
# some addresses might not be valid -- it goes through Google's API
loc = Geocoder.geocode('Roberta\'s, Brooklyn')
loc.coordinates

(40.7050766, -73.9335923)

We can check whether it's working fine at http://www.latlong.net/

In [45]:
locations = ['Roberta\'s, Brooklyn', 'White Sands National Park', 'Bronx Botanical Garden']

coordinates = [Geocoder.geocode(location).coordinates for location in locations]

In [46]:
for location, coordinate in zip(locations, coordinates):
    print('The coordinates of {} are:'.format(location), coordinate)

The coordinates of Roberta's, Brooklyn are: (40.7050766, -73.9335923)
The coordinates of White Sands National Park are: (32.7872403, -106.3256816)
The coordinates of Bronx Botanical Garden are: (40.862452, -73.8802382)


Define a dictionary for the parameters we want to pass to the GET request for NWS server.

In [47]:
keys = {}
for location, coordinate in zip(locations, coordinates):
    keys[location] = {'lat' : coordinate[0], 'lon' : coordinate[1]}

In [48]:
keys

{'Bronx Botanical Garden': {'lat': 40.862452, 'lon': -73.8802382},
 "Roberta's, Brooklyn": {'lat': 40.7050766, 'lon': -73.9335923},
 'White Sands National Park': {'lat': 32.7872403, 'lon': -106.3256816}}

In [49]:
url = ' http://forecast.weather.gov/MapClick.php'    
nws = requests.get(url, params=keys[locations[0]])

In [50]:
nws.url

'http://forecast.weather.gov/MapClick.php?lon=-73.9335923&lat=40.7050766'

In [None]:
nws.content[:100]

In [51]:
nws_soup = BeautifulSoup(nws.content, 'html.parser')

In [52]:
seven = nws_soup.find('div', id='seven-day-forecast-container')

In [53]:
seven.find(text='Friday').parent

<p class="period-name">Friday<br><br/></br></p>

In [54]:
seven.find(text='Friday').parent.parent

<div class="tombstone-container">
<p class="period-name">Friday<br><br/></br></p>
<p><img alt="Friday: Rain.  High near 43. Breezy, with an east wind 13 to 21 mph, with gusts as high as 31 mph.  Chance of precipitation is 100%. New precipitation amounts between a half and three quarters of an inch possible. " class="forecast-icon" src="newimages/medium/ra100.png" title="Friday: Rain.  High near 43. Breezy, with an east wind 13 to 21 mph, with gusts as high as 31 mph.  Chance of precipitation is 100%. New precipitation amounts between a half and three quarters of an inch possible. "/></p><p class="short-desc">Rain and<br>Breezy</br></p><p class="temp temp-high">High: 43 °F</p></div>

In [55]:
seven.find(text='Friday').parent.parent.find('p', class_='temp temp-high').get_text()

'High: 43 °F'

In [56]:
data = []
for location in locations:
    nws = requests.get(url, params=keys[location])
    nws_soup = BeautifulSoup(nws.content, 'html.parser')
    seven = nws_soup.find('div', id='seven-day-forecast-container')
    temp = seven.find(text='Friday').parent.parent.find('p', class_='temp temp-high').get_text()
    data.append([location, temp])

In [59]:
df_weather = pd.DataFrame(data, columns=['Location', 'Friday weather'])

In [60]:
df_weather

Unnamed: 0,Location,Friday weather
0,"Roberta's, Brooklyn",High: 43 °F
1,White Sands National Park,High: 74 °F
2,Bronx Botanical Garden,High: 39 °F
