This was a challenging exercise to complete because of the diverse ways websites write their corresponding HTML information.  I was able to scrape the Seattle Parks website for relevant information but when the scraped addresses were input into the google maps API, about 15% of them gave back lat/lng coordinates for the midwest or east coast. 

In [5]:
#Import all of the necessary modules for the code

from bs4 import BeautifulSoup
import urllib2
import re
import lxml
import csv
import json
from time import sleep

#set some variables for the url to be scraped and location of files
url = "http://www.seattle.gov/parks/listall.asp"
directory = "U:/Seattle_parks.csv"
geoDirectory = "U:/Seattle_parks_geocode.csv"

#define the geocoding function using the google maps API
def geocode(address):
    url = ("http://maps.googleapis.com/maps/api/geocode/json?"
        "sensor=false&address={0}".format(address.replace(" ", "+")))
    return json.loads(urlopen(url).read())

#Write a csv file with the names and addresses of each Seattle Park
# using beautiful soup
with open(directory, 'w') as f:
    fieldnames = ("Park_Name", "Address")
    output = csv.writer(f, delimiter=",")
    output.writerow(fieldnames)
    
    page = urllib2.urlopen(url)
    soup = BeautifulSoup(page.read())
    
    names = soup.find_all('a', href=re.compile('park_detail'))
    names = [name.contents[0].strip() for name in names]

    places = soup.find_all("td", {"width" : "43%"})
    places = [place.contents[0].strip() for place in places]
    
    for n,p in zip(names, places):
        output.writerow((n, p))
        
print "Done writing csv file"

Done writing csv file


Some of the following code was borrowed from other google users that were attempting to solve similar problems.  I was able to adapt their lines of code to my own situation.  One issue I ran into here was the need to include the sleep function near the bottom.  Without the sleep function at the end of the loop, my code output a file with only the first dozen or so records with lat/lng information and the rest are empty.  

In [6]:
with open(directory, 'r') as f:
    reader = csv.DictReader(f, delimiter=',')
    
    with open(geoDirectory, 'w') as w:
        fields = ["Park_Name", "Address", "lat", "lng"]
        writer = csv.DictWriter(w, fieldnames=fields, delimiter=',')
        writer.writeheader()

        for line in reader:
            print "Geocoding: {0}".format(line["Address"])
            response = geocode(line["Address"] + " Seattle WA")
            if response["status"] == u"OK":
                results = response.get("results")[0]
                line["Address"] = results["formatted_address"]
                line["lat"] = results["geometry"]["location"]["lat"]
                line["lng"] = results["geometry"]["location"]["lng"]
            else:
                line["Address"] = ""
                line["lat"] = ""
                line["lng"] = ""
                print "No API info"
            sleep(.5)
            writer.writerow(line)

print "Done writing csv file"

Geocoding: 564 12th Ave
Geocoding: 4400 14th Ave NW
Geocoding: 3001 E Madison St
Geocoding: 32nd Ave W
Geocoding: 606  NW 76th St
Geocoding: 723 N 35th St
Geocoding: Lake Washington Blvd S & S Adams St
Geocoding: 12526 27th Ave NE
Geocoding: 1702 Alki Ave SW
Geocoding: 5817 SW Lander St
Geocoding: 1504 34TH Ave
Geocoding: 2000 Martin Luther King Jr Way S
Geocoding: 4000 Beach Dr SW
Geocoding: 3431 Arapahoe Pl W
Geocoding: 4120 Arroyo Dr SW
Geocoding: 8702 Seward Park Ave S
Geocoding: 1501 21st Ave S
Geocoding: 4020 Fremont Ave N
Geocoding: 2548 Delmar Dr E
Geocoding: 8347 14th Ave NW
Geocoding: 5701 22nd Ave NW
Geocoding: 1702 nw 62nd St
Geocoding: 2644 NW 60th St
Geocoding: 7802 Banner Way NE
Geocoding: 6425 SW Admiral Way
Geocoding: 2614 24th Ave W
Geocoding: 3rd Ave W & W Prospect St
Geocoding: 1902 13th Ave S
Geocoding: 1110 S Dearborn St
Geocoding: 5th Ave NE & NE 103rd St
Geocoding: 5809 15th Ave NE
Geocoding: 8650 55th Ave S
Geocoding: 1st to 5th Ave on Bell St
Geocoding: Bellev

I tried to get a shapefile output in two different ways.  The first way I tried was to use ArcMap tools.  The 'MakeXYEventLayer' and 'FeatureClassToSHapefile' tools were able to produce a shapefile with th ecorresponding lat/lng coordinates but the file was not spatially referenced appropriately.  I was unable to determine a fix for this and so I tried a second method using shapely.  This was a method linked to in the assignment literature.  I encountered the same issue there.

In [183]:
import sys
sys.path.append('C:\\Program Files (x86)\\ArcGIS\\Desktop10.3\\bin')
sys.path.append('C:\\Program Files (x86)\\ArcGIS\\Desktop10.3\\arcpy')
sys.path.append('C:\\Program Files (x86)\\ArcGIS\\Desktop10.3\\ArcToolbox\\Scripts')
import arcpy

#set environment workspace and overwrite functions
arcpy.env.workspace = "U:/"
arcpy.env.overwriteOutput = True

latlngTab = "U:/GEOG458/Web scraper/Seattle_parks_geocode.csv"
sr = arcpy.SpatialReference(4326)

arcpy.arcpy.MakeXYEventLayer_management(latlngTab,"lat","lng","temp.lyr",sr)
arcpy.FeatureClassToShapefile_conversion('temp.lyr',"U:/")

print "conversion complete"


conversion complete


In [201]:
import csv
from shapely.geometry import Point, mapping
from fiona import collection

schema = { 'geometry': 'Point', 'properties': { 'Park Name': 'str' , 'Address': 'str'} }
with collection(
    "Seattle_Parks.shp", "w", "ESRI Shapefile", schema) as output:
    with open(geoDirectory, 'rb') as f:
        reader = csv.DictReader(f)
        for row in reader:
            point = Point(float(row['lng']), float(row['lat']))
            output.write({
                'properties': {
                    'Park Name': row['Park Name'],
                    'Address': row['Address']
                },
                'geometry': mapping(point)
            })
    

While running short on time I tried to use shapely/fiona to export the same information into a different file format (ie. GeoJSON) but encountered an error.

In [None]:
import csv
from shapely.geometry import Point, mapping
from fiona import collection

schema = { 'geometry': 'Point', 'properties': { 'Park Name': 'str' , 'Address': 'str'} }
with collection(
    "Seattle_Parks", "w", "GeoJSON", schema) as output:
    with open(geoDirectory, 'rb') as f:
        reader = csv.DictReader(f)
        for row in reader:
            point = Point(float(row['lng']), float(row['lat']))
            output.write({
                'properties': {
                    'Park Name': row['Park Name'],
                    'Address': row['Address']
                },
                'geometry': mapping(point)
            })

