## Week 2 Class activities
This notebook is a starting point for the exercises and activities that we'll do in class.

Before you attempt any of these activities, make sure to watch the video lectures for this week.

### Scraping permit data
Here's the code that we saw in the video lecture that queries the City of Seattle permit website, gets a dataframe of permits (including the URL), and then digs down further into that permit-specific URL.

In [1]:
# get the permit data from the API
import json
import requests
import pandas as pd
from bs4 import BeautifulSoup

url = 'https://data.seattle.gov/resource/ht3q-kdvx.json' # copied and pasted from the webpage
r = requests.get(url)
df = pd.DataFrame(json.loads(r.text))

df = df.head(5) # get the first 5 rows, so we don't overload the city's website.

# get an example link
permiturl = df.loc[0,'link']['url']
print(permiturl)

# request that page and get the soup object
r = requests.get(permiturl)
soup = BeautifulSoup(r.text)
print(soup.prettify())

https://cosaccela.seattle.gov/portal/customize/LinkToRecord.aspx?altId=3001212-LU
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<html lang="en-US" ng-app="appAca" xml:lang="en-US" xmlns="http://www.w3.org/1999/xhtml" xmlns:fb="http://www.facebook.com/2008/fbml" xmlns:og="http://ogp.me/ns#">
 <head id="ctl00_Head1">
  <link href="../App_Themes/Default/_progressbar.css" rel="stylesheet" type="text/css"/>
  <link href="../App_Themes/Default/breadcrumb.css" rel="stylesheet" type="text/css"/>
  <link href="../App_Themes/Default/Calendar.css" rel="stylesheet" type="text/css"/>
  <link href="../App_Themes/Default/custom.css" rel="stylesheet" type="text/css"/>
  <link href="../App_Themes/Default/font.css" rel="stylesheet" type="text/css"/>
  <link href="../App_Themes/Default/form.css" rel="stylesheet" type="text/css"/>
  <link href="../App_Themes/Default/grid.css" rel="stylesheet" type="text/css"/>
  <link href="../App

In [2]:
# then we wrote this code to extract the project description 
links = soup.find_all('td')
for link in links:
    if 'Project Description' in link.text: 
        sublinks = link.find_all('td')
        description = sublinks[1].text
        # once we find a description, we exit
        break
    
print(description)

PROJECT CANCELLED 12/8/2010 -- This short plat has an ECA exemption in the project planning template. A limited exemption was granted. Processing short plat with the ECA exemption #3002070.


<div class="alert alert-block alert-info">
<strong>Exercise:</strong> If you look at the example, there is a <strong>Legal Description</strong> section. Extract that to a variable and print it.
</div>

Suggestions: this is a complex problem, so let's break it down step by step. This is my thought process - other ways may work too.

First, if we search (CTRL-F) for Legal Description in the `soup` above, we see that it's within some `tr` tags.  So let's `find_all` the content between each pair of `tr` tags, loop over it until we find the right one, and then look at that more closely. 

In [3]:
# then we wrote this code to extract the project description 
links = soup.find_all('tr')
for link in links:
    if 'Legal Description' in link.text: 
        # once we find a description, we exit
        break
    
print(link)

<tr id="TRMoreDetail" style="display: none;">
<td class="moredetail_td">
                 </td>
<td colspan="2" style="height: 24px;">
<div style="text-align: center;">
<span id="ctl00_PlaceHolderMain_PermitDetailList1_tbASIList">
<table border="0" cellpadding="0" cellspacing="0" role="presentation" style="width: 98%">
<tr>
<td class="MoreDetail_BlockTitle">
<h1>
<a class="NotShowLoading" href="javascript:void(0);" id="lnkASI" onclick='ControlDisplay($get("trASIList"),$get("imgASI"),true,$get("lnkASI"),$get("ctl00_PlaceHolderMain_PermitDetailList1_lblASIList"))' title="Expand Application Information">
<img alt="Expand" id="imgASI" src="/Portal/app_themes/Default/assets/plus_expand.gif" style="cursor: pointer; border-width:0px;"/></a> <span id="ctl00_PlaceHolderMain_PermitDetailList1_lblASIList">Application Information</span></h1>
</td>
</tr>
<tr id="trASIList" style="display: none;">
<td class="MoreDetail_BlockContent">
<div id="ctl00_PlaceHolderMain_PermitDetailList1_phPlumbingGroup"

Looking at the output, it looks like the relevant text is within another `tr` tag|. So let's do the same as before - just one level down.

In [4]:
sublinks = link.find_all('tr')
for sublink in sublinks:
    if 'Legal Description' in sublink.text: 
        # once we find a description, we exit
        break
    
print(sublink.text)




Development Site Parcel:DV0001932Legal Description:PAR A, LBA #8806752, THT POR BLK 13, KINNEARS 1ST RAINIER BEACH ADD, AND POR TRC A, BLK 2, GUTHERIES TERRACE PARK, AND POR SE 1/4 (FILE)





Got it! Now, which element of the `sublinks` list was it? Let's do trial and error.

In [5]:
print(sublinks[0].text)





 Application Information




Not that one. Let's try the next.

In [6]:
# and so on, until we get it
print(sublinks[3].text)




Development Site Parcel:DV0001932Legal Description:PAR A, LBA #8806752, THT POR BLK 13, KINNEARS 1ST RAINIER BEACH ADD, AND POR TRC A, BLK 2, GUTHERIES TERRACE PARK, AND POR SE 1/4 (FILE)





<div class="alert alert-block alert-info">
<strong>Exercise:</strong> Now turn that into a function that you can apply to each row of your dataframe. Add a new column, <strong>legal_description</strong>, to your dataframe.
</div>

In [7]:
# I just copied and pasted the code above
# and indented it into a function
def get_legal(urldict):
    permiturl = urldict['url']

    # request that page and get the soup object
    r = requests.get(permiturl)
    soup = BeautifulSoup(r.text)
    links = soup.find_all('tr')
    for link in links:
        if 'Legal Description' in link.text: 
            sublinks = link.find_all('tr')
            description = sublinks[3].text
            # once we find a description, we exit
            return description

get_legal(df.loc[0,'link'])

'\n\n\nDevelopment Site Parcel:DV0001932Legal Description:PAR A, LBA #8806752, THT POR BLK 13, KINNEARS 1ST RAINIER BEACH ADD, AND POR TRC A, BLK 2, GUTHERIES TERRACE PARK, AND POR SE 1/4 (FILE)\n\n\n'

Now we can apply it to the dataframe.

In [None]:
df['legal_description'] = df.link.apply(get_legal)
# check the results
df.head()

### Fixing errors
We'll do more scraping in just a moment. But first, let's do some examples of how to interpret an error message, and fix it.

<div class="alert alert-block alert-info">
<strong>Exercise:</strong> Each of the cells below will generate an error. Look at the error message and see if you can figure out how to fix it. (Don't Google it until you try to figure it out based on the error message.)
</div>

In [None]:
# the housingunitsremoved and housingunitsadded give useful information
# let's create a new column with netunits
df['netunits'] = df.housingunitsadded - df.housingunitsremoved

In [None]:
# we need to convert them to a float first
df['netunits'] = df.housingunitsadded.astype(float) - df.housingunitsremoved.astype(float)
df['netunits']

In [None]:
# print the address of the first row
print('Address of first row is {}. Permit type is {}'.format(df.iloc[0].originaladdress1))

In [None]:
# We had two placeholders {} but only one variable to insert into them
# We could delete one of the {} or add a second argument to the format()
print('Address of first row is {}. Permit type is'.format(df.iloc[0].originaladdress1))
print('Address of first row is {}. Permit type is {}'.format(df.iloc[0].originaladdress1, df.iloc[0].permitclass))

In [None]:
# Convert the number of housing units to integers
# and then summarize

df['unitsadded_numeric'] = df.housingunitsadded.astype(int)
df.unitsadded_numeric.describe(

In [None]:
# the first problem was our missing parenthesis

df['unitsadded_numeric'] = df.housingunitsadded.astype(int)
df.unitsadded_numeric.describe()

In [None]:
# our second problem was the data type. An integer type cannot hold NaN
# so we do float
df['unitsadded_numeric'] = df.housingunitsadded.astype(float)
df.unitsadded_numeric.describe()


### Scraping craigslist

In the lecture, we saw how to scrape the main page (the list of posts).

What if you want to get more information about (say) a particular apartment?

Here's the code from the lecture that gets a dataframe of the first 120 posts. Notice that there is a `url` column.

In [8]:
url = 'https://losangeles.craigslist.org/search/lac/hhh'
r = requests.get(url)

soup = BeautifulSoup(r.content)
posts = soup.find_all('li', class_= 'result-row')

postList = []

for post in posts:
    result_price = post.find('span', class_= 'result-price')
    if result_price is None:
        price = None
    else:
        price = result_price.text
    
    resulthood = post.find('span', class_= 'result-hood')
    if resulthood is None:
        neighborhood = None
    else:
        neighborhood = resulthood.text 
        
    # we can also have our if..else statements as a one-liner
    # this is identical to the above
    neighborhood = None if resulthood is None else resulthood.text

    housing = post.find('span', class_= 'housing')
    housingsize = None if housing is None else housing.text
        
    # these two fields seem to be always present, so no need to check for None
    title = post.find('a', class_= 'result-title').text
    url = post.find('a', class_= 'result-title')['href']

    # now put them in the dictionary, and append to our list
    postList.append({'price': price, 'neighborhood':neighborhood, 
                     'housingsize':housingsize, 'title':title, 'url':url})

df = pd.DataFrame(postList)
df.head()

Unnamed: 0,price,neighborhood,housingsize,title,url
0,"$2,558","(930 Figuroa Terrace, Los Angeles, CA)",\n 2br -\n ...,"Expansive Sundecks, Private Balconies/Patios, ...",https://losangeles.craigslist.org/lac/apa/d/lo...
1,"$1,650",(HOLLYWOOD/W),\n 525ft2 -\n,"***YOUR HW DREAM STUDIO, 4 LESS***",https://losangeles.craigslist.org/lac/apa/d/lo...
2,"$3,500",(LOS ANGELES / LOS FELIZ),\n 3br -\n,*****************FULLY RENOVATED TOWNHOUSE BAC...,https://losangeles.craigslist.org/lac/apa/d/lo...
3,"$3,270",(central LA 213/323),\n 1br -\n ...,Luxury Has It's Rewards,https://losangeles.craigslist.org/lac/apa/d/lo...
4,"$2,861",(Hollywood),\n 932ft2 -\n,"Stainless Steel Appliances, Resort pool and de...",https://losangeles.craigslist.org/lac/apa/d/lo...


<div class="alert alert-block alert-info">
<strong>Exercise:</strong> For the first url in your dataframe, use requests to get the content of the post. (No need to create a soup object yet.)
</div>

In [9]:
# your code here
# put the output of the request in a variable called r
# so you can access the content like this
print(r.content)

b'<!DOCTYPE html>\n<html>\n<head>\n    \n\t<meta charset="UTF-8">\n\t<meta http-equiv="X-UA-Compatible" content="IE=Edge">\n\t<meta name="viewport" content="width=device-width,initial-scale=1">\n\t<meta property="og:site_name" content="craigslist">\n\t<meta name="twitter:card" content="preview">\n\t<meta property="og:title" content="central LA housing - craigslist">\n\t<meta name="description" content="central LA housing - craigslist">\n\t<meta property="og:description" content="central LA housing - craigslist">\n\t<meta property="og:url" content="https://losangeles.craigslist.org/search/lac/hhh">\n\t<meta name="smartbanner:api" content="true">\n\t<meta name="smartbanner:title" content="the craigslist app">\n\t<meta name="smartbanner:author" content="what&#39;s old is new">\n\t<meta name="smartbanner:icon-apple" content="/images/app_icon.png">\n\t<meta name="smartbanner:icon-google" content="/images/app_icon.png">\n\t<meta name="smartbanner:button" content="view">\n\t<meta name="smartb

Now let's extract more information from the page. We have a couple of strategies here. First, we could skip trying to parse the page with `BeautifulSoup`, and just see if particular bits of text are present.

For example, what transportation modes does the post emphasize? Do they mention Section 8 vouchers? Some of this might be exploratory—we can see what type of language is included, and then parse in a more structured way (e.g. distinguishing between "No Section 8" and "Section 8 welcome").

<div class="alert alert-block alert-info">
    <strong>Exercise:</strong> Write a function that will return True if Section 8 is mentioned, otherwise False.

*Hint*: the `in` operator is a simple way to do this. For example:

In [10]:
'plan' in 'urban planning'

True

In [11]:
'plan' in 'Urban Planning' 

False

In [18]:
# your code here to return Section 8 information
'Section 8' in r.text

True

Most of the post is free-form text. So there's not going to be much value added by `BeautifulSoup`.

The exceptions are (i) parking, and (ii) the geographic coordinates.

<div class="alert alert-block alert-info">
    <strong>Exercise:</strong> Write a function that will return True if the apartment has no parking, and also returns the lat/lon of the apartment

*Hint*: First, create a `soup` object. Then, look and see what tag and class encloses this information. Then, you can experiment with `find` and `find_all` with this tag and class.

In [22]:
# your code here
import json
import requests
import pandas as pd
from bs4 import BeautifulSoup

url = 'https://losangeles.craigslist.org/search/lac/hhh' # copied and pasted from the webpage
r = requests.get(url)
df = pd.DataFrame(json.loads(r.text))

df = df.head(5) # get the first 5 rows, so we don't overload the city's website.




JSONDecodeError: Expecting value: line 1 column 1 (char 0)

In [None]:
# get an example link
permiturl = df.loc[0,'link']['url']
print(permiturl)

# request that page and get the soup object
r = requests.get(permiturl)
soup = BeautifulSoup(r.text)
print(soup.prettify())

<div class="alert alert-block alert-info">
<strong>Exercise:</strong> Apply this function to your dataframe, and create new columns for parking, lat, and lon.
</div>

In [None]:
# your code here

<div class="alert alert-block alert-info">
<h3>What you should have learned</h3>
<ul>
  <li>Gain confidence in experimenting with code - exploring different objects, writing functions, and so on</li>
  <li>Learn how to extract information from a scraped webpage - how to do the detective work.</li>
  <li>Gain confidence in debugging errors.</li>
</ul>
</div>