# Plot multiple addresses on Google Maps

This notebook shows how to take a list of addresses, clean them up a bit and them plot them all on googlemaps. This will them allow them to be visualised at the same time. As the GoogleMaps API will be used, the first goal will therefore be to have a lot of addresses in a form that can simply be passed into that API.

For some basis data, I took the address of some colleges in Bavaria. I got this list of addresses by simply copying them from the following website: **[deutsch-werden.de](http://www.deutsch-werden.de/alle-volkshochschulen-deutschland-liste-vhs-adressen-kontaktdaten-fuer-einbuergerungstest-kurse "Have a look at the dataon the website")** and pasting them into a file called `raw_addresses`.

In [1]:
import csv
import itertools
import json
import gmplot as gm
import geocoder

## 1 - Preparing the data

The first few addresses from the file `raw_addresses.txt` are shown below:

In [2]:
! head -n 13 raw_addresses.txt

Bayerischer Volkshochschulverband
Fäustlestraße 5a
80339 München
T: 089 5027081
F: 089 5027082
bvv@vhs-bayern.de

Volkshochschule Abensberg
Stadtplatz 1
93326 Abensberg
Telefon: 09443 90105
F: 09443 90106
info@vhs-abensberg.de


---------------------

Clearly this is not a nice form to pass to Google, with phone numbers, email addresses and the address spread over many lines. Let's clean them up...

Define a function that works through the file of raw addresses. It does a few things:

1. The try/except block itself catches the empty lines between single addresses and simple insers something to let us clearly separate the addresses.
2. In the try/except block, the first `if` statement allows us to ignore lines that start with T: of F:, meaning the line holds the telephone or fax number, respectively.
3. The next `elif` statement removes lines that contain more than five digits. German postal codes contain five digits - any more and it must be a phone number _(in the context of this data set)_
4. The last `elif` state leaves out lines that contain the `@` symbol - so basically the lines with email addresses.
5. If none of these conditions are met, then the line is assumed to be useful and we append it to a list containing lines to keep.

In [3]:
def extract_addresses(file, fill_empty_line='NEXT_ADDRESS', verbose=True):
    """get the addresses of all the places
    
    file:             the path to the file with addresses
    fill_empty_line:  a string to simply use as a placeholder for empty lines between addresses
    
    returns:          addresses in a list
    """

    with open(file, 'r') as f:

        file_reader = csv.reader(f)

        lines_to_keep = []

        for i, line in enumerate(file_reader):

            try: # two checks to remove telephone/fax numbers
                if line[0][0:2] == 'T:' or line[0][0:2] == 'F:':
                    if verbose:
                        print('skipped a T:/F: number')
                    pass                       

                elif sum(char.isnumeric() for char in line[0]) > 5:
                    if verbose:
                        print('skipped a telephone/fax number')
                    pass

                # skip any lines containing an email address
                # assumes names and addresses do not contain the @ symbol
                elif '@' in line[0]:
                    if verbose:
                        print('removed email address: {}'.format(line))

                else:
                    lines_to_keep.append(line)

            except IndexError: # catch the empty lines between each address
                lines_to_keep.append(['NEXT_ADDRESS'])
            
    return lines_to_keep

In [4]:
addresses = extract_addresses('./raw_addresses.txt', verbose=False)

In [5]:
# create a dictionary to hold all addresses, then append the lats and longs later
address_book = {}

# separate the address, using the default 'NEXT_ADDRESS' lines added inplace of empty lines above
tmp = []
counter = 1
for line in addresses:
    if line == ['NEXT_ADDRESS']:
        chunk_addr = ', '.join([l[0] for l in tmp])
        address_book[str(counter)] = chunk_addr
        tmp = []
        counter += 1
    else:
        #print('line = {}'.format(line))
        tmp.append(line)

In [6]:
for k, v in address_book.items():
    if int(k) <= 5:
        print('Address {}: {}'.format(k, v))

Address 1: Bayerischer Volkshochschulverband, Fäustlestraße 5a, 80339 München
Address 2: Volkshochschule Abensberg, Stadtplatz 1, 93326 Abensberg
Address 3: Volkshochschule Landkreis Aichach-Friedberg, Steubstr. 3, 86551 Aichach
Address 4: Volkshochschule Alt-Neuötting - Töging am Inn, Neuöttinger Str. 32, 84503 Altötting
Address 5: Volkshochschule Amberg, Zeughausstr. 1a, 92224 Amberg


These look pretty good - the sort of thing you could use to search in Google Maps!
Now have a look at all of the addresses to see if there are any patterns missed in the cleanup statements of the `extract_addresses` function.

In [7]:
for k, v in address_book.items():
    print('Address {}: {}'.format(k, v))

Address 1: Bayerischer Volkshochschulverband, Fäustlestraße 5a, 80339 München
Address 2: Volkshochschule Abensberg, Stadtplatz 1, 93326 Abensberg
Address 3: Volkshochschule Landkreis Aichach-Friedberg, Steubstr. 3, 86551 Aichach
Address 4: Volkshochschule Alt-Neuötting - Töging am Inn, Neuöttinger Str. 32, 84503 Altötting
Address 5: Volkshochschule Amberg, Zeughausstr. 1a, 92224 Amberg
Address 6: Volkshochschule Amberg-Sulzbach, Obere Gartenstraße 3, 92237 Sulzbach-Rosenberg
Address 7: Volkshochschule Ammersee-Nordwest, Schulweg 2, 86919 Utting a. Ammersee
Address 8: Volkshochschule Ansbach, Martin-Luther-Platz 1, 91522 Ansbach
Address 9: Volkshochschule Ansbach Landkreis, Crailsheimstraße 1, 91522 Ansbach
Address 10: Volkshochschule Arnstorf, Unterer Markt 12, 94424 Arnstorf
Address 11: Volkshochschule Aschaffenburg, Luitpoldstr. 2, 63739 Aschaffenburg
Address 12: Volkshochschule Augsburger Land e.V., Holbeinstraße 12, 86150 Augsburg
Address 13: Volkshochschule Augsburg Stadt, Willy-B

There are a few strange looking things - two empty lines (printed as "Address 24" and "Address 154"). This can be explained by having a look at the input file: `raw_addresses.txt`. 

<img src="empty_line.png",width=350,height=150>

It is clear that there is as unexpected empty line between the main address and the email address of the college. We could adjust our function above to cover this issue, but it is better to be pragmatic here and simply remove the line in the input data.

After removing that line, rerun the code above and save the dictionary of addresses in JSON format, as it is nice a clean for a dictionary, which is pretty much of the same format.

In [8]:
with open('address_book.json', 'w') as target:
    json.dump(address_book, target)
    
# we can read this back into a python dictionary using:
with open('address_book.json', 'r') as source:
    poop = json.load(source)

In [9]:
# Check that the two are identical
address_book == poop

True

## 2 - Retrieving lattitudes and longitudes

Now we can use the dictionary of addresses to create plots, but using the `gmplot` package, it will first be necessary to convert our addresses above into lattitude-longitude cordinates. We can do this by using another great package called [`geocoder`](https://github.com/DenisCarriere/geocoder "Go to the GitHub page").

Reading the documentation of [`gmplot`](https://pypi.python.org/pypi/gmplot/1.0.5 "Read the docs"), it seems we can use a scatter plot or heatmap to show many results on a single map nicely. For this, we need a list of lattitures and a list of longitudes. Lets create this using the address book and the `geocoder` package.

In [8]:
addr_list = list(address_book.values())

lattitudes = []
longitudes = []

for addr in addr_list:
    coords = geocoder.google(addr)
    lattitudes.append(coords.lat)
    longitudes.append(coords.lng)

In [9]:
# have a peek at the lattitudes
lattitudes

[48.1388329,
 None,
 48.45919500000001,
 48.229024,
 49.4438699,
 49.50392369999999,
 48.02702,
 49.3026819,
 49.3026819,
 48.56262,
 49.975957,
 None,
 48.36532,
 47.85574,
 50.2016628,
 50.3234946,
 47.72318139999999,
 47.76026,
 49.8990318,
 49.89620439999999,
 50.164226,
 None,
 None,
 49.03433,
 47.63439,
 47.82863,
 None,
 None,
 None,
 49.8919242,
 48.16936159999999,
 49.4592152,
 49.2262189,
 47.85175,
 50.26074440000001,
 48.35026,
 48.25499629999999,
 48.83679,
 47.9487248,
 48.56975569999999,
 None,
 48.7201124,
 None,
 None,
 48.40137,
 None,
 None,
 49.80292,
 None,
 None,
 49.7199885,
 47.8420626,
 48.40301969999999,
 48.8561633,
 48.1788344,
 49.4746854,
 47.56991,
 None,
 None,
 49.9002,
 None,
 48.19359,
 None,
 48.454779,
 49.11590340000001,
 48.10722,
 50.114788,
 50.03565390000001,
 50.0303146,
 47.99777,
 49.51061,
 49.568087,
 None,
 50.31054810000001,
 50.31926,
 48.76391530000001,
 50.0750904,
 50.0677574,
 48.22713,
 49.96232,
 47.88149139999999,
 48.0892875,
 

In [10]:
# have a peek at the longitudes
longitudes

[11.5390872,
 None,
 11.129734,
 12.6796216,
 11.8583805,
 11.7471154,
 11.08919,
 10.5722592,
 10.5722592,
 12.81616,
 9.144433,
 None,
 10.90461,
 12.00442,
 10.0753922,
 10.2158824,
 12.8770748,
 11.5625,
 10.8996666,
 10.8930328,
 11.075237,
 None,
 None,
 11.47062,
 13.0035,
 10.64204,
 None,
 None,
 None,
 9.589843,
 12.8313118,
 10.8563626,
 12.6574303,
 12.34756,
 10.9621527,
 11.37745,
 11.4381907,
 12.9623699,
 11.1035164,
 10.5234843,
 None,
 10.7798867,
 None,
 None,
 12.76142,
 None,
 None,
 9.1569,
 None,
 None,
 11.0598739,
 12.977001,
 11.7444365,
 13.3846951,
 11.2577203,
 10.9874951,
 10.70083,
 None,
 None,
 10.35001,
 None,
 11.37079,
 None,
 10.27904,
 10.7552183,
 11.72421,
 9.890392300000002,
 10.5149789,
 10.5038115,
 11.17485,
 11.43285,
 10.881424,
 None,
 11.9219963,
 11.9136101,
 11.4279717,
 9.004652499999999,
 9.1653878,
 11.47781,
 9.766375199999999,
 10.6233135,
 10.8524753,
 10.311323,
 10.1621592,
 12.0690283,
 10.56626,
 11.3257,
 10.3591287,
 11.4581

There are some missing values - I assume it is because Google was find those locations. We can check how many of these cases there were and possibly just remove them if there weren't many. Another alternative would be to clean up the addresses again so that Google gets a higher success rate at returning the coordinates.'

In [11]:
count = 0

for i, j in zip(lattitudes, longitudes):
    if i == j == None:
        count += 1
    
print('Number of failed addresses:    {}'.format(count))
print('Number of fails as percentage: {:.1f}%'.format(count / len(lattitudes) * 100))

Number of failed addresses:    24
Number of fails as percentage: 13.3%


Let's just remove the missing pairs (**None** values), then check it worked as expeceted by running the same code from above to check the number of None values in the lists

In [12]:
for i, (lat, lng) in enumerate(zip(lattitudes, longitudes)):
    if lat == lng == None:
        del lattitudes[i]
        del longitudes[i]
        
count = 0

for i, j in zip(lattitudes, longitudes):
    if i == j == None:
        count += 1
    
print('Number of failed addresses:    {}'.format(count))
print('Number of fails as percentage: {:.1f}%'.format(count / len(lattitudes) * 100))

Number of failed addresses:    6
Number of fails as percentage: 3.7%


Lets append the lattitudes and longitudes to the address book and save it, overwriting the first save earlier on...

In [13]:
address_book['lattitudes'] = lattitudes
address_book['longitudes'] = longitudes

with open('address_book.json', 'w') as target:
    json.dump(address_book, target)

## Plotting the maps

Now we have the lattitudes and longitudes in lists, we can plot them using the `gmplot` package...

In [15]:
gmap = gm.GoogleMapPlotter(lattitudes[0], longitudes[0], zoom=8)
gmap.scatter(lattitudes[1:], longitudes[1:], color='r', marker=True)

gmap.draw('addresses_on_map.html')

Now we just need to open the final html to see the addresses on the map!

On a linux system, the following should work...

In [16]:
! sensible-browser addresses_on_map.html

[5539:5573:0617/110155.857705:ERROR:browser_gpu_channel_host_factory.cc(103)] Failed to launch GPU process.
Created new window in existing browser session.


The resulting html page is fully interactive (you can zoom and pan) and looks like this:

<img src="addresses_on_map_redpin.png",width=800,height=800>

We can also try the **heatmap** for the fun of it...

In [17]:
gmap_heatmap = gm.GoogleMapPlotter(lattitudes[0], longitudes[0], zoom=8)
gmap_heatmap.heatmap(lattitudes[1:], longitudes[1:], threshold=1, radius=15, opacity=1)

gmap_heatmap.draw('addresses_as_heatmap.html')

In [18]:
! sensible-browser addresses_as_heatmap.html

[5745:5779:0617/110555.125655:ERROR:browser_gpu_channel_host_factory.cc(103)] Failed to launch GPU process.
Created new window in existing browser session.


<img src="addresses_on_map_heatmap.png",width=800,height=800>