# Geocoding (Mapping your Address)

![gcod](images/gcod1.jpg)

## Where do you find Addresses

1. When you are posting something.

![gcod](images/gcod2.jpg)

2. When you are using GPS (Driving to a new place)

![gcod3](images/gcod3.jpg)

3. When you are using google maps

![gcod4](images/gcod4.png)

4. When you are using social media platforms (Twitter, Facebook)

![gcod5](images/gcod5.jpg)

5. If you are working in a hospital facility you might have interacted with the EHR

![gcod6](images/gcod6.jpg)

6. You would have seen addresses in Flyers

![gcod7](images/gcod7.jpg)

7. Or even in newspapers

![gcod8](images/gcod8.png)

Can you think about other sources of addresses that you have encountered before???

## Are Addresses directly Mappable?

Let's try out a dataset

In [2]:
import pandas as pd
import geopandas as gpd

In [5]:
addresses = pd.read_csv(r'data/Address_Zip_44106_small.csv')
addresses

Unnamed: 0,id,Address,CITY,REGION,POSTCODE
0,15850,1972 E 120 ST,CLEVELAND,OH,44106
1,31822,2200 DELAWARE DR,CLEVELAND HEIGHTS,OH,44106
2,47687,1937 E 120 ST,CLEVELAND,OH,44106
3,47688,1961 E 120 ST,CLEVELAND,OH,44106
4,65312,1960 DENTON DR,CLEVELAND HEIGHTS,OH,44106
...,...,...,...,...,...
94,172382,1531 E 118 ST,CLEVELAND,OH,44106
95,172383,1527 E 118 ST,CLEVELAND,OH,44106
96,172384,1515 E 118 ST,CLEVELAND,OH,44106
97,172386,10900 EUCLID AVE,CLEVELAND,OH,44106


We won't be able to convert this into a geodataframe as there is no column that can be used to create geometry. 

So how do we map these addresses (Hint!!!The name of this chapter)

## What is geocoding?

>Geocoding is the process of **transforming a description of a location—such as a pair of coordinates, an address, or a name of a place—to a location on the earth's surface**. 

Ok let's look into a more simpler definition. 

> Geocoding takes an **address as input, then translates it to a location on a map**. In short, it **changes an address to lat long coordinates (latitude and longitude)**.

![gcod](images/gcod9.png)

### What are geocodes?

> Geocodes are a **set of latitude and longitude coordinates of a physical location**.

### Types of Geographic Location Descriptions that can be geocoded

1. Addresses

This is the most common source of input for geocoding. 

12471 Cedar Rd, Cleveland Heights, Ohio 44106

2. Place Names

Place Names are hard to resolve and hence hard to geocode. Most of the current geocoders doesnot handle place names better. However geocoding services provided by vendors such as Google, Bing, and Baidu are very good at handling place names. 

The Eiffel Tower

Now let's get our hand dirty!!

### Geocoding using Geopandas

Geopandas support geocding through a library called geopy (https://geopy.readthedocs.io/en/stable/). Let us geocode our addresses dataset using geopandas. 

In [7]:
addresses['fullAddress'] = addresses.Address+' '+addresses.CITY+' '+addresses.REGION+' '+addresses.POSTCODE

TypeError: can only concatenate str (not "int") to str

We are trying to concatenate the various columns into a single column. But seems like some of the columns have datatype as int.

In [8]:
addresses.dtypes

id           int64
Address     object
CITY        object
REGION      object
POSTCODE     int64
dtype: object

As you can see the POSTCODE column is having intger as its data type.

In [9]:
addresses['fullAddress'] = addresses.Address+' '+addresses.CITY+' '+addresses.REGION+' '+addresses.POSTCODE.astype(str)

In [10]:
addresses

Unnamed: 0,id,Address,CITY,REGION,POSTCODE,fullAddress
0,15850,1972 E 120 ST,CLEVELAND,OH,44106,1972 E 120 ST CLEVELAND OH 44106
1,31822,2200 DELAWARE DR,CLEVELAND HEIGHTS,OH,44106,2200 DELAWARE DR CLEVELAND HEIGHTS OH 44106
2,47687,1937 E 120 ST,CLEVELAND,OH,44106,1937 E 120 ST CLEVELAND OH 44106
3,47688,1961 E 120 ST,CLEVELAND,OH,44106,1961 E 120 ST CLEVELAND OH 44106
4,65312,1960 DENTON DR,CLEVELAND HEIGHTS,OH,44106,1960 DENTON DR CLEVELAND HEIGHTS OH 44106
...,...,...,...,...,...,...
94,172382,1531 E 118 ST,CLEVELAND,OH,44106,1531 E 118 ST CLEVELAND OH 44106
95,172383,1527 E 118 ST,CLEVELAND,OH,44106,1527 E 118 ST CLEVELAND OH 44106
96,172384,1515 E 118 ST,CLEVELAND,OH,44106,1515 E 118 ST CLEVELAND OH 44106
97,172386,10900 EUCLID AVE,CLEVELAND,OH,44106,10900 EUCLID AVE CLEVELAND OH 44106


In [11]:
geocoded = gpd.tools.geocode(addresses['fullAddress'])

In [12]:
geocoded

Unnamed: 0,geometry,address
0,POINT (-81.60476 41.50893),"Museum of Contemporary Art Cleveland, Euclid A..."
1,POINT (-81.59578 41.49795),"Delaware Drive, 44106, Cleveland Heights, Ohio..."
2,POINT (-81.60476 41.50893),"Museum of Contemporary Art Cleveland, Euclid A..."
3,POINT (-81.60476 41.50893),"Museum of Contemporary Art Cleveland, Euclid A..."
4,POINT (-81.59900 41.49618),"Denton Drive, 44106, Cleveland Heights, Ohio, ..."
...,...,...
94,POINT (-81.60476 41.50893),"Museum of Contemporary Art Cleveland, Euclid A..."
95,POINT (-81.60476 41.50893),"Museum of Contemporary Art Cleveland, Euclid A..."
96,POINT (-81.60476 41.50893),"Museum of Contemporary Art Cleveland, Euclid A..."
97,POINT (-81.60070 41.50139),"Case Western Reserve University, 10900, Euclid..."


The geocoded dataset is of type GeoDataFrame having two columns geometry and address

In [14]:
type(geocoded)

geopandas.geodataframe.GeoDataFrame

Now let's merge this columns back to the original dataset

In [16]:
addresses = pd.concat([addresses,geocoded],axis=1)

In [17]:
type(addresses)

pandas.core.frame.DataFrame

Now we need to conver this to a geodataframe. Since we already have a geometry column, this is relatively easy

In [19]:
addressesGeo = gpd.GeoDataFrame(addresses,crs=geocoded.crs)

#### Trying a different geocoding provider. 

There are many geocoding services (free as well as proprietary) which we can use with GeoPandas. Let's try the Nominatim geocoder provided by Open Street Map (OSM)

In [21]:
geocodedNominatim = gpd.tools.geocode(addresses['fullAddress'],provider='nominatim', user_agent="test")

In [22]:
geocodedNominatim

Unnamed: 0,geometry,address
0,GEOMETRYCOLLECTION EMPTY,
1,POINT (-81.59598 41.49932),"2200, Delaware Drive, Cedar Fairmount, Clevela..."
2,GEOMETRYCOLLECTION EMPTY,
3,GEOMETRYCOLLECTION EMPTY,
4,POINT (-81.60193 41.49635),"1960, Denton Drive, Ambler Heights, Cleveland ..."
...,...,...
94,GEOMETRYCOLLECTION EMPTY,
95,GEOMETRYCOLLECTION EMPTY,
96,GEOMETRYCOLLECTION EMPTY,
97,POINT (-81.60070 41.50139),"Case Western Reserve University, 10900, Euclid..."


As you can see some of the addresses are None indicating that the geocoding service was unable to successfully geocode those addresses.

There are much better geocoding services provided by Google, Biadu, and Bing which are not free (pay per requests). 

### Various levels of Geocode

A coordinate can be assigned to an address based on various level of information a geocoder has

1. Rooftop Geocodes

It is the most accurate geocode type and provides the exact location for the address.

![gcod](images/gcod10.png)

2. Parcel Centroid Geocodes

In this case the address is assigned to the centroid of the parcel boundary for the property. *"A 'parcel', 'lot', or 'tract' is a piece of land (or 'real property') with defined boundaries. "*. This is also relatively accurate. 

![gcod11](images/gcod11.png)

3. Interpolated Geocodes

Interpolation methods uses information about address number ranges to estimate the position of a numbered address. For example there is no direct address match for the location 1149 38th St, Sacramento, California, 95816. But if the geocoder has the street segment (which is a line segment) geometry with it and if the address range of 38 street starts from 1001 and ends at 1299, the geocoder will assign the center point of the street line as the geocode for 1149 38th St, Sacramento, California, 95816.

Is that accurate???. Well it depends up on what level of accuracy you are looking for.

![gcod12](images/gcod12.png)

4. Zip Geocode

In this type of geocode, the address is assigned to the centroid of Zip code polygon that it resides in. For example if the geocoder is not able to geocode the address 2439 Overlook Rd, Cleveland,Ohio,44106, it can assign the address to the centroid of the Zip code polygon for 44106. This type of approach has a very low accuracy and users should be aware of such assignments.

![gcod13](images/gcod13.PNG)

### Geocoding Pitfalls and Remedies

As we have seen, Geocoding Accuracy is arbitrary. There is no one answer for how accurate geocoding is because there are several factors that you need to consider before you can accurately answer this question.

The level of geocoding accuracy you need depends up on many factors,

1. If you want to identify clusters of a particular infectious disease and notify first responders, then you need good geocoding accuracy. You also want to make sure that you don't loose much of valuable information by tossing away records that are not geocoded. 

A problem with geocodes such as zip centroid is that it can generate "spurious clusters" during analysis. For example if a geocoder is matching many addresses to a zip code centroid for the zip code 44106, then there would be a sudden increase of cases at the zip code centroid which will raise a false alarm (such issues are very common in EHR databases).


2. Use of **remote geocoding webservices can be a violation of privacy rules for health data.**

If you are using remote webservices (which sends request via internet) for geocoding, you are sending health data (which is supposed to highly confidential) over the wire (internet). The organization that is providing the geocoding service can collect the addresses as well as other private information from such requests. 

The work around will be to use standalone geocoder, where the address database sits locally or at a secure research location. Such kind of standalone geocoders are provided by ArcGIS (needs license) or you can build your own geocoding service using a readily available data source like the Tiger Line files and a database like PostgreSQL  


3. Tossing records that doesn't have geocode when data is at a premium.

If you are looking at a rather new and complex disease that is spread only among a few people, you would not want to toss out records that doesn't have a geocode, rather you could try multiple geocoders (depending up on your budget) to extract out the geocode. 


4. If you just want to aggregate the addresses at zip code level and if the addresses already have a zip code number, you don't even need to perfrom the relatively costly geocoding. 

You can directly extract out the zip code number from the address and at a later point merge it to a zip code polygon that has the zip code number as its attribute ( we will look into such table based merges in the upcoming sessions). 
