## Week 2 Class activities
This notebook is a starting point for the exercises and activities that we'll do in class.

Before you attempt any of these activities, make sure to watch the video lectures for this week.

### Scraping permit data
Here's the code that we saw in the video lecture that queries the City of Seattle permit website, gets a dataframe of permits (including the URL), and then digs down further into that permit-specific URL.

In [1]:
# get the permit data from the API
import json
import requests
import pandas as pd
from bs4 import BeautifulSoup

url = 'https://data.seattle.gov/resource/ht3q-kdvx.json' # copied and pasted from the webpage
r = requests.get(url)
df = pd.DataFrame(json.loads(r.text))

df = df.head(5) # get the first 5 rows, so we don't overload the city's website.

# get an example link
permiturl = df.loc[0,'link']['url']
print(permiturl)

# request that page and get the soup object
r = requests.get(permiturl)
soup = BeautifulSoup(r.text)
print(soup.prettify())

https://cosaccela.seattle.gov/portal/customize/LinkToRecord.aspx?altId=3001212-LU
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<html lang="en-US" ng-app="appAca" xml:lang="en-US" xmlns="http://www.w3.org/1999/xhtml" xmlns:fb="http://www.facebook.com/2008/fbml" xmlns:og="http://ogp.me/ns#">
 <head id="ctl00_Head1">
  <link href="../App_Themes/Default/_progressbar.css" rel="stylesheet" type="text/css"/>
  <link href="../App_Themes/Default/breadcrumb.css" rel="stylesheet" type="text/css"/>
  <link href="../App_Themes/Default/Calendar.css" rel="stylesheet" type="text/css"/>
  <link href="../App_Themes/Default/custom.css" rel="stylesheet" type="text/css"/>
  <link href="../App_Themes/Default/font.css" rel="stylesheet" type="text/css"/>
  <link href="../App_Themes/Default/form.css" rel="stylesheet" type="text/css"/>
  <link href="../App_Themes/Default/grid.css" rel="stylesheet" type="text/css"/>
  <link href="../App

In [2]:
# then we wrote this code to extract the project description 
links = soup.find_all('td')
for link in links:
    if 'Project Description' in link.text: 
        sublinks = link.find_all('td')
        description = sublinks[1].text
        # once we find a description, we exit
        break
    
print(description)

PROJECT CANCELLED 12/8/2010 -- This short plat has an ECA exemption in the project planning template. A limited exemption was granted. Processing short plat with the ECA exemption #3002070.


<div class="alert alert-block alert-info">
<strong>Exercise:</strong> If you look at the example, there is a <strong>Legal Description</strong> section. Extract that to a variable and print it.
</div>

Suggestions: this is a complex problem, so let's break it down step by step. This is my thought process - other ways may work too.

First, if we search (CTRL-F) for Legal Description in the `soup` above, we see that it's within some `tr` tags.  So let's `find_all` the content between each pair of `tr` tags, loop over it until we find the right one, and then look at that more closely. 

In [None]:
# then we wrote this code to extract the project description 
links = soup.find_all('tr')
for link in links:
    if 'Legal Description' in link.text: 
        # once we find a description, we exit
        break
    
print(link)

Looking at the output, it looks like the relevant text is within another `tr` tag|. So let's do the same as before - just one level down.

In [None]:
sublinks = link.find_all('tr')
for sublink in sublinks:
    if 'Legal Description' in sublink.text: 
        # once we find a description, we exit
        break
    
print(sublink.text)

Got it! Now, which element of the `sublinks` list was it? Let's do trial and error.

In [None]:
print(sublinks[0].text)

Not that one. Let's try the next.

In [None]:
# and so on, until we get it
print(sublinks[3].text)

<div class="alert alert-block alert-info">
<strong>Exercise:</strong> Now turn that into a function that you can apply to each row of your dataframe. Add a new column, <strong>legal_description</strong>, to your dataframe.
</div>

In [None]:
# I just copied and pasted the code above
# and indented it into a function
def get_legal(urldict):
    permiturl = urldict['url']

    # request that page and get the soup object
    r = requests.get(permiturl)
    soup = BeautifulSoup(r.text)
    links = soup.find_all('tr')
    for link in links:
        if 'Legal Description' in link.text: 
            sublinks = link.find_all('tr')
            description = sublinks[3].text
            # once we find a description, we exit
            return description

get_legal(df.loc[0,'link'])

Now we can apply it to the dataframe.

In [None]:
df['legal_description'] = df.link.apply(get_legal)
# check the results
df.head()

### Fixing errors
We'll do more scraping in just a moment. But first, let's do some examples of how to interpret an error message, and fix it.

<div class="alert alert-block alert-info">
<strong>Exercise:</strong> Each of the cells below will generate an error. Look at the error message and see if you can figure out how to fix it. (Don't Google it until you try to figure it out based on the error message.)
</div>

In [None]:
# the housingunitsremoved and housingunitsadded give useful information
# let's create a new column with netunits
df['netunits'] = df.housingunitsadded - df.housingunitsremoved

In [None]:
# we need to convert them to a float first
df['netunits'] = df.housingunitsadded.astype(float) - df.housingunitsremoved.astype(float)
df['netunits']

In [None]:
# print the address of the first row
print('Address of first row is {}. Permit type is {}'.format(df.iloc[0].originaladdress1))

In [None]:
# We had two placeholders {} but only one variable to insert into them
# We could delete one of the {} or add a second argument to the format()
print('Address of first row is {}. Permit type is'.format(df.iloc[0].originaladdress1))
print('Address of first row is {}. Permit type is {}'.format(df.iloc[0].originaladdress1, df.iloc[0].permitclass))

In [None]:
# Convert the number of housing units to integers
# and then summarize

df['unitsadded_numeric'] = df.housingunitsadded.astype(int)
df.unitsadded_numeric.describe(

In [None]:
# the first problem was our missing parenthesis

df['unitsadded_numeric'] = df.housingunitsadded.astype(int)
df.unitsadded_numeric.describe()

In [None]:
# our second problem was the data type. An integer type cannot hold NaN
# so we do float
df['unitsadded_numeric'] = df.housingunitsadded.astype(float)
df.unitsadded_numeric.describe()


### Scraping craigslist

In the lecture, we saw how to scrape the main page (the list of posts).

What if you want to get more information about (say) a particular apartment?

Here's the code from the lecture that gets a dataframe of the first 120 posts. Notice that there is a `url` column.

In [1]:
import json
import requests
import pandas as pd
from bs4 import BeautifulSoup

url = 'https://losangeles.craigslist.org/search/lac/hhh'
r = requests.get(url)

soup = BeautifulSoup(r.content)
posts = soup.find_all('li', class_= 'result-row')

postList = []

for post in posts:
    result_price = post.find('span', class_= 'result-price')
    if result_price is None:
        price = None
    else:
        price = result_price.text
    
    resulthood = post.find('span', class_= 'result-hood')
    if resulthood is None:
        neighborhood = None
    else:
        neighborhood = resulthood.text 
        
    # we can also have our if..else statements as a one-liner
    # this is identical to the above
    neighborhood = None if resulthood is None else resulthood.text

    housing = post.find('span', class_= 'housing')
    housingsize = None if housing is None else housing.text
        
    # these two fields seem to be always present, so no need to check for None
    title = post.find('a', class_= 'result-title').text
    url = post.find('a', class_= 'result-title')['href']

    # now put them in the dictionary, and append to our list
    postList.append({'price': price, 'neighborhood':neighborhood, 
                     'housingsize':housingsize, 'title':title, 'url':url})

df = pd.DataFrame(postList)
df.head()

Unnamed: 0,price,neighborhood,housingsize,title,url
0,"$2,730",(central LA 213/323),\n 2br -\n ...,"Fitness CenterPhoto, Vaulted Ceilings, View",https://losangeles.craigslist.org/lac/apa/d/lo...
1,"$2,807","(1026 S Broadway, Los Angeles, CA)",\n 2br -\n ...,"Arched Skyline Viewing Windows, Central Air/He...",https://losangeles.craigslist.org/lac/apa/d/lo...
2,$200,(HOLLYWOOD),\n 880ft2 -\n,looking for a female roomate,https://losangeles.craigslist.org/lac/roo/d/lo...
3,"$2,447","(101 Bridewell St, Los Angeles, CA)",\n 2br -\n ...,"Abundant Closet Space, Fitness Center, Electri...",https://losangeles.craigslist.org/lac/apa/d/lo...
4,"$2,084","(616 St. Paul Ave., Los Angeles, CA)",\n 468ft2 -\n,"Closet Organizers, Automatic Dishwasher, Fax a...",https://losangeles.craigslist.org/lac/apa/d/lo...


<div class="alert alert-block alert-info">
<strong>Exercise:</strong> For the first url in your dataframe, use requests to get the content of the post. (No need to create a soup object yet.)
</div>

In [2]:
# your code here
# put the output of the request in a variable called r
# so you can access the content like this
url = df.loc[0, 'url']
r = requests.get(url)
print(r.content)

b'<!DOCTYPE html>\n<html>\n<head>\n    \n\t<meta charset="UTF-8">\n\t<meta http-equiv="X-UA-Compatible" content="IE=Edge">\n\t<meta name="viewport" content="width=device-width,initial-scale=1">\n\t<meta property="og:site_name" content="craigslist">\n\t<meta name="twitter:card" content="preview">\n\t<meta property="og:title" content="Be the first to live in this newly renovated 1BD/1BA- Ready Now! -...">\n\t<meta name="description" content="************************************************************** ************************************************************** Essex Properties at Miracle Mile 400 S. Detroit . | Los Angeles, CA...">\n\t<meta property="og:description" content="************************************************************** ************************************************************** Essex Properties at Miracle Mile 400 S. Detroit . | Los Angeles, CA...">\n\t<meta property="og:image" content="https://images.craigslist.org/00000_f0JdezzBw0Bz_0t20CI_600x450.jpg">\n\t<me

Now let's extract more information from the page. We have a couple of strategies here. First, we could skip trying to parse the page with `BeautifulSoup`, and just see if particular bits of text are present.

For example, what transportation modes does the post emphasize? Do they mention Section 8 vouchers? Some of this might be exploratory—we can see what type of language is included, and then parse in a more structured way (e.g. distinguishing between "No Section 8" and "Section 8 welcome").

<div class="alert alert-block alert-info">
    <strong>Exercise:</strong> Write a function that will return True if Section 8 is mentioned, otherwise False.

*Hint*: the `in` operator is a simple way to do this. For example:

In [3]:
'plan' in 'urban planning'

True

In [4]:
'plan' in 'Urban Planning' 

False

In [8]:
# we can use the same approach to see 
# if a string is in the text that we retrieved via requests
# note the use of lower() to avoid case sensitivity
'section 8' in r.text.lower()

False

In [10]:
# but this will return True if the text is in the string
'los angeles' in r.text.lower()

True

In [2]:
# so let's put this in a function

def sect8(url):
   r = requests.get(url)
   return 'section 8' in r.text.lower()

# test it
sect8(url)

False

In [3]:
# now apply to the whole dataframe
# we pass the URL for each posting (the url column) to apply, 
# and call our sect8 function on that url
# sect8 returns True or False, and we store that in the new column, section8

df['section8'] = df.url.apply(sect8)

In [5]:
# most seem False
df.head()

Unnamed: 0,price,neighborhood,housingsize,title,url,section8
0,"$2,730",(central LA 213/323),\n 2br -\n ...,"Fitness CenterPhoto, Vaulted Ceilings, View",https://losangeles.craigslist.org/lac/apa/d/lo...,False
1,"$2,807","(1026 S Broadway, Los Angeles, CA)",\n 2br -\n ...,"Arched Skyline Viewing Windows, Central Air/He...",https://losangeles.craigslist.org/lac/apa/d/lo...,False
2,$200,(HOLLYWOOD),\n 880ft2 -\n,looking for a female roomate,https://losangeles.craigslist.org/lac/roo/d/lo...,False
3,"$2,447","(101 Bridewell St, Los Angeles, CA)",\n 2br -\n ...,"Abundant Closet Space, Fitness Center, Electri...",https://losangeles.craigslist.org/lac/apa/d/lo...,False
4,"$2,084","(616 St. Paul Ave., Los Angeles, CA)",\n 468ft2 -\n,"Closet Organizers, Automatic Dishwasher, Fax a...",https://losangeles.craigslist.org/lac/apa/d/lo...,False


In [7]:
# but in my version, I got at least one posting that was True
df.section8.mean() # the fraction that are True

0.008333333333333333

In [8]:
# get the rows of the dataframe where section8 is True
df[df.section8]

Unnamed: 0,price,neighborhood,housingsize,title,url,section8
16,"$2,495",(LOS FELIZ AREA),\n 2br -\n ...,HOLLYWOOD(love pets) GARDENS ($500 DEPOSIT)*AC...,https://losangeles.craigslist.org/lac/apa/d/lo...,True


Most of the post is free-form text. So there's not going to be much value added by `BeautifulSoup`.

The exceptions are (i) parking, and (ii) the geographic coordinates.

<div class="alert alert-block alert-info">
    <strong>Exercise:</strong> Write a function that will return True if the apartment has no parking, and also returns the lat/lon of the apartment

*Hint*: First, create a `soup` object. Then, look and see what tag and class encloses this information. Then, you can experiment with `find` and `find_all` with this tag and class.

In [25]:
# first, let's look at the first row
url = df.loc[5, 'url']
# print the url so that you can click on it and look at it in your browser
print(url)

# get a soup object
r = requests.get(url)
soup = BeautifulSoup(r.content)
print(soup.prettify())












https://losangeles.craigslist.org/lac/apa/d/los-angeles-go-by-fast-wonderful-two/7469983603.html
<!DOCTYPE html>
<html>
 <head>
  <meta charset="utf-8"/>
  <meta content="IE=Edge" http-equiv="X-UA-Compatible"/>
  <meta content="width=device-width,initial-scale=1" name="viewport"/>
  <meta content="craigslist" property="og:site_name"/>
  <meta content="preview" name="twitter:card"/>
  <meta content="Go by fast a wonderful two bedroom loft! Apply today! - apts/housing..." property="og:title"/>
  <meta content="Experience urban-style elegance and European living in the heart of downtown Los Angeles. The Da Vinci, the newest member of the Renaissance collection, offers fifteen unique apartment floor plans..." name="description"/>
  <meta content="Experience urban-style elegance and European living in the heart of downtown Los Angeles. The Da Vinci, the newest member of the Renaissance collection, offers fifteen unique apartment floor plans..." property="og:description"/>
  <meta content="h

In [26]:
# in my example, the webpage said "carport" in the right hand panel
# I did a CTRL-F, and this was the result

#<p class="attrgroup">
#       <span>
#        apartment
#       </span>
#       <br/>
#       <span>
#        w/d in unit
#       </span>
#       <br/>
#       <span>
#        carport
#       </span>
#       <br/>
#      </p>

# so it looks like we want the tag p and the class attrgroup
links = soup.find_all('p', class_='attrgroup')
links

[<p class="attrgroup">
 <span class="shared-line-bubble"><b>2BR</b> / <b>2Ba</b></span>
 <span class="shared-line-bubble"><b>1246</b>ft<sup>2</sup></span>
 </p>,
 <p class="attrgroup">
 <span>apartment</span>
 <br/>
 <span>w/d in unit</span>
 <br/>
 <span>carport</span>
 <br/>
 </p>]

In [28]:
# I see that the result was a list of length 2, 
# and what we want is in the second element
links[1]

<p class="attrgroup">
<span>apartment</span>
<br/>
<span>w/d in unit</span>
<br/>
<span>carport</span>
<br/>
</p>

In [29]:
# then the next level down, it looks like it's within the span tag
# and the 3rd element of the list
links[1].find_all('span')[2].text

# I get "carport"

# A simpler way would be to search for "no parking" in r.text! 

'carport'

In [31]:
# what about lat and lon?
# I found this in <div id="map" class="viewposting"
links = soup.find_all('div', class_='viewposting')
links


[<div class="viewposting" data-accuracy="5" data-latitude="34.061400" data-longitude="-118.238500" id="map"></div>]

In [32]:
# it's in a list of length 1
# we could get this via link = links[0]
# or use find (which gets the first instance) rather than find_all
link = soup.find('div', class_='viewposting')
link

<div class="viewposting" data-accuracy="5" data-latitude="34.061400" data-longitude="-118.238500" id="map"></div>

In [33]:
# This functions like a dictionary object!
lat = link['data-latitude']
lon = link['data-longitude']
lat, lon


('34.061400', '-118.238500')

<div class="alert alert-block alert-info">
<strong>Exercise:</strong> Apply this function to your dataframe, and create new columns for parking, lat, and lon.
</div>

In [34]:
# let's put this together
# just pasting the code we wrote above into a function
def get_latlong(url):
    r = requests.get(url)
    soup = BeautifulSoup(r.content)
    link = soup.find('div', class_='viewposting')
    lat = link['data-latitude']
    lon = link['data-longitude']
    return (lat, lon)

# test it 
get_latlong(url)

('34.061400', '-118.238500')

In [36]:
# and apply!
# I got an error with one post that was missing this info
# so I added a try: except block to the function
# it returns None if something goes wrong

def get_latlong(url):
    r = requests.get(url)
    soup = BeautifulSoup(r.content)
    link = soup.find('div', class_='viewposting')
    try:
        lat = link['data-latitude']
        lon = link['data-longitude']
        return (lat, lon)
    except:
        return None

df['latlon'] = df.url.apply(get_latlong)


In [37]:
df.head()

# we'd then do the same for parking

Unnamed: 0,price,neighborhood,housingsize,title,url,section8,latlon
0,"$2,730",(central LA 213/323),\n 2br -\n ...,"Fitness CenterPhoto, Vaulted Ceilings, View",https://losangeles.craigslist.org/lac/apa/d/lo...,False,"(34.094823, -118.340510)"
1,"$2,807","(1026 S Broadway, Los Angeles, CA)",\n 2br -\n ...,"Arched Skyline Viewing Windows, Central Air/He...",https://losangeles.craigslist.org/lac/apa/d/lo...,False,"(34.040567, -118.257731)"
2,$200,(HOLLYWOOD),\n 880ft2 -\n,looking for a female roomate,https://losangeles.craigslist.org/lac/roo/d/lo...,False,"(34.069900, -118.349200)"
3,"$2,447","(101 Bridewell St, Los Angeles, CA)",\n 2br -\n ...,"Abundant Closet Space, Fitness Center, Electri...",https://losangeles.craigslist.org/lac/apa/d/lo...,False,"(34.114500, -118.192900)"
4,"$2,084","(616 St. Paul Ave., Los Angeles, CA)",\n 468ft2 -\n,"Closet Organizers, Automatic Dishwasher, Fax a...",https://losangeles.craigslist.org/lac/apa/d/lo...,False,"(34.055900, -118.266600)"


<div class="alert alert-block alert-info">
<h3>What you should have learned</h3>
<ul>
  <li>Gain confidence in experimenting with code - exploring different objects, writing functions, and so on</li>
  <li>Learn how to extract information from a scraped webpage - how to do the detective work.</li>
  <li>Gain confidence in debugging errors.</li>
</ul>
</div>