In this notebook I take a look at geocoding the addresses in the IRS csvs and identify issues going forward.
The geocoder module I'm using is documented at https://geocoder.readthedocs.io/ and can be installed with pip.

In [1]:
import geocoder
import pandas as pd
import os

I'll load in a sample of the csvs for this exploratory work - all 3 files from 2013 and all 3 files from 1989. All these files are stored in a file called 'clean' in my working directory.

In [2]:
pc2013 = pd.read_csv('clean/2013pc_clean.csv')
pf2013 = pd.read_csv('clean/2013pf_clean.csv')
co2013 = pd.read_csv('clean/2013co_clean.csv')
pc1989 = pd.read_csv('clean/1989pc_clean.csv')
pf1989 = pd.read_csv('clean/1989pf_clean.csv')
co1989 = pd.read_csv('clean/1989co_clean.csv')

In [3]:
dfs = [pc2013, pf2013, co2013, pc1989, pf1989, co1989]

In [4]:
# Example usage of geocoder using osm backend

g = geocoder.osm('365 5TH AVE NEW YORK NY 10016')
g.latlng

[40.74851885, -73.9836392743124]

Let's take a look at some of the addresses in the csvs:

In [5]:
pc1989.ADDRESS[:15]

0      200 PARK AVE S 1116
1             17 E 47TH ST
2             462 BROADWAY
3             60 E 42ND ST
4             30 E END AVE
5              2 W 45TH ST
6               58 7TH AVE
7        128 PIERREPONT ST
8          200 EASTERN PKY
9         30 LAFAYETTE AVE
10      PO BOX 421 FDR STA
11    45 22 DOUGLASTON PKY
12          47 01 111TH ST
13          31 00 47TH AVE
14        87 05 CHELSEA ST
Name: ADDRESS, dtype: object

The geocoder wants full addresses. This will require creating a new column in the csvs with the full address (ADDRESS, CITY, STATE, ZIP).

In [6]:
def full_address(df):
    # Adds a new column to dataframe df containing the full address - returns None if any value is missing
    
    address = []
    add = df.ADDRESS
    city = df.CITY
    state = df.STATE
    zip_code = df.ZIP
    for i in range(len(add)):
        a = isinstance(add[i], str)
        b = isinstance(city[i], str)
        c = isinstance(state[i], str)
        d = isinstance(zip_code[i], str)
        if (a & b & c & d): curr_address = " ".join([add[i], city[i], state[i], zip_code[i]])
        else: curr_address = None 
        address.append(curr_address)
        
    df['FULL_ADDRESS'] = address
    return df

In [7]:
for df in dfs:
    full_address(df)

In [8]:
def geocode_df(df):
    # Adds two new columns to dataframe df: latitude and longitude. Adds None if no address or if geocoding fails
    LAT = []
    LONG = []
    for address in df.FULL_ADDRESS:
        g = geocoder.osm(address)
        if g.latlng == None: 
            LAT.append(None)
            LONG.append(None)
        else:
            LAT.append(g.latlng[0])
            LONG.append(g.latlng[1])
    
    df['LAT'] = LAT
    df['LONG'] = LONG
    return df

In [12]:
geocode_df(pf2013)

Unnamed: 0.1,Unnamed: 0,KeyID,EIN,DocCD,TAXPER,NTEEIRS,EOSTATUS,STYEAR,RECCODE,ASS_CODE,...,PMSA,LONGITUDE,LATITUDE,censusTract,block,NAICS,VerifyBy,FULL_ADDRESS,LAT,LONG
0,318,358,10733159,91,201312,A70,1.0,2013,Y,,...,5600.0,-74.004552,40.748714,,,710000.0,,525 W 24TH ST NEW YORK NY 10011-1104,40.744765,-73.995159
1,1253,1511,30420728,91,201310,A50,1.0,2012,Y,,...,5600.0,-73.988666,40.743272,,,712110.0,,410 PARK AVE STE 1710 NEW YORK NY 10022-9433,,
2,1273,1541,30453739,91,201403,B05,1.0,2013,Y,,...,5600.0,-73.823300,40.761300,,,511100.0,,PO BOX 604802 BAYSIDE NY 11360-0000,,
3,2720,3123,43649708,91,201312,A51,1.0,2013,Y,,...,5600.0,-73.968109,40.804705,,,712110.0,,375 RIVERSIDE DR APT 13B NEW YORK NY 10025-2149,,
4,2823,3244,43789287,91,201312,A20,1.0,2013,N,,...,5600.0,-73.975200,40.757700,,,710000.0,,666 FIFTH AVE 28 FL NEW YORK NY 10103-0001,,
5,3373,4319,46124122,91,201312,,1.0,2013,N,,...,5600.0,-73.982530,40.746000,,,813219.0,,475 PARK AVE S 31ST F NEW YORK NY 10016-0000,,
6,4959,6212,61532031,91,201312,A12,1.0,2013,N,,...,5600.0,-73.980615,40.764992,,,710000.0,,888 7TH AVE STE 1101 NEW YORK NY 10106-0206,,
7,5175,6449,61633557,91,201312,B820,1.0,2013,N,,...,5600.0,-73.942194,40.745860,,,711130.0,,GORDON OSTROWSKI 120 CLAREMONT AVE NEW YORK NY...,,
8,5208,6486,61655858,91,201406,A51,1.0,2013,Y,,...,5600.0,-73.975547,40.749760,,,712110.0,,622 THIRD AVENUE 33RD FLOOR NEW YORK NY 10017-...,,
9,5245,6527,61683335,91,201312,A60,1.0,2013,N,,...,5600.0,-73.987500,40.771500,,,711100.0,,165 WEST 66TH STREET NEW YORK NY 10023-6508,40.774762,-73.984074


Current issues:
- Figuring out number of requests possible per unit time (can we make it faster?)
- How to deal with apartment numbers/floor numbers/etc.
- Do we want information for post offices? Where do we want to map PO Boxes?
- Difference between lat/long from csv and from osm
- What are we using this information for?