# Missouri Sex Offender Registry - Failed Geocoding

Data acquisition, documentation, carpentry, geocoding, and database loading for Missouri Sex Offender Registry (MSOR) and supporting info.   
Here we will attempt to recover the MSOR entries that failed geocoding.

In [1]:
# IMPORTS
import geopandas as gpd
import pandas as pd

import matplotlib.pyplot as plt
from matplotlib import pyplot

import folium

from shapely.geometry import Point, Polygon

from geopy.geocoders import Nominatim # for geocoding

import random # for obscuring sex offender names

In [2]:
# we need GeoAlchemy2 to run the geodataframe to_postgis method later

In [3]:
# pip install GeoAlchemy2

In [4]:
pip install GeoAlchemy2==0.10.2

Collecting GeoAlchemy2==0.10.2
  Downloading https://files.pythonhosted.org/packages/df/b4/94b1f707dc89d107ac0a49a1f36a45b8b57812e603951f84bef999df3e3b/GeoAlchemy2-0.10.2-py2.py3-none-any.whl
Installing collected packages: GeoAlchemy2
Successfully installed GeoAlchemy2-0.10.2
Note: you may need to restart the kernel to use updated packages.


In [5]:
# a few more imports specfic to the database process
import geoalchemy2 
import getpass

import psycopg2
import numpy
from psycopg2.extensions import adapt, register_adapter, AsIs

from sqlalchemy import create_engine


In [6]:
# get user password for connecting to the db
mypasswd = getpass.getpass()

········


In [32]:
# set up db connection
conn = psycopg2.connect(database = 'cappsds_psmd39', 
                              user = 'psmd39', 
                              host = 'pgsql.dsa.lan',
                              password = mypasswd)


In [33]:
# establish cursor and read the existing tables
cursor = conn.cursor()

cursor.execute("""SELECT relname FROM pg_class WHERE relkind='r'
                  AND relname !~ '^(pg_|sql_)';""") # "rel" is short for relation.

tables = [i[0] for i in cursor.fetchall()] # A list() of tables.
tables.sort()
tables


['country_borders',
 'gadm_admin_borders',
 'geonames_feature',
 'msorfailedgeocoding',
 'msorfailedgeocodingv2',
 'spatial_ref_sys',
 'stlchildcare',
 'stlnonrestrictedresidential',
 'stlnonrestrictedresparcels',
 'stlpubschools',
 'stlpvtschools',
 'stlresparcels',
 'stlrestrictedflat',
 'stlsexoffenders',
 'stlzoning']

### Get the entries that failed geocoding out of the database
In the prior notebook, we stored all these records in a dedicated table for easy access.

In [9]:
# query the table and read data into a dataframe
sql = "select * from msorfailedgeocoding;"
msor_nogeo = pd.read_sql_query(sql, conn)
print(msor_nogeo.shape)
msor_nogeo.head()

(2200, 24)


Unnamed: 0,index,randomid,name,address,city,st,zip,county,offense,offense_city,...,compliant,tier,date_of_birth,offense_date,conviction_date,confinement_release_date,probation/parole_release_date,offender_age_at_time_of_offense,full_address,geocode
0,5,47218,"ABBOTT, STEVEN R",5067 ENRIGHT APT 1C,ST LOUIS,MO,63108,ST LOUIS CITY,CHILD MOLEST-1ST DEGREE,DUNKLIN,...,Y,3,1982-08-29,2010-01-23,2012-02-23,2019-07-30,2023-06-23,27,"5067 ENRIGHT APT 1C,ST LOUIS,MO",
1,6,1021,"ABBOTT, STEVEN R",5067 ENRIGHT APT 1C,ST LOUIS,MO,63108,ST LOUIS CITY,SEXUAL MISCONDUCT-1ST,KENNETT,...,Y,3,1982-08-29,2002-02-19,2002-06-21,NaT,NaT,19,"5067 ENRIGHT APT 1C,ST LOUIS,MO",
2,7,11969,"ABBOTT, STEVEN R",5067 ENRIGHT APT 1C,ST LOUIS,MO,63108,ST LOUIS CITY,STAT SODOMY-1ST DEG-PERS UND 14,DUNKLIN,...,Y,3,1982-08-29,2010-01-23,2012-02-23,2019-07-30,NaT,27,"5067 ENRIGHT APT 1C,ST LOUIS,MO",
3,12,38581,"ABDI, IBRAHIM A",3764 CHIPPEWA ST APT 8,SAINT LOUIS,MO,63116,ST LOUIS CITY,SEXUAL MISCONDUCT-3RD,ST LOUIS,...,Y,1,1981-09-08,2004-11-14,2006-03-06,NaT,NaT,23,"3764 CHIPPEWA ST APT 8,SAINT LOUIS,MO",
4,23,40681,"ABERNATHY, RANDELL L",3866 S SPRING AVE APT 1S,SAINT LOUIS,MO,63116,ST LOUIS CITY,AGG CRIM SEX ASSAULT,LEXINGTON,...,Y,3,1969-07-30,1993-11-01,1994-01-24,NaT,NaT,24,"3866 S SPRING AVE APT 1S,SAINT LOUIS,MO",


In [10]:
# examine some of the entries that we need to fix 
# we are looking for common trends and features that could be causing problems with the geocoder
msor_nogeo_unique = msor_nogeo[['randomid','address','geocode']]
msor_nogeo_unique.drop_duplicates(subset=['address','geocode']).head(10)

Unnamed: 0,randomid,address,geocode
0,47218,5067 ENRIGHT APT 1C,
3,38581,3764 CHIPPEWA ST APT 8,
4,40681,3866 S SPRING AVE APT 1S,
6,45170,3329 LAWN AVE APT 4,
8,8721,4133 CLEVELAND AVE APT 1W,
9,11576,5340 GRANT ST FL 2ND,
10,10749,3764 CHIPPEWA ST APT 12,
11,26411,9756 LILAC DR APT C,
13,41206,120 W CATALAN AVE APT 201,
17,42918,1080 ROTH AVE APT A,


### Set up the geocoder

In [11]:
# set up the geocoder
geolocator = Nominatim(timeout=10, user_agent = "myGeolocator")

In [12]:
# test out the geocoder with a single address
location = geolocator.geocode('120 CATALAN,ST LOUIS,MO')
print(location)
print((location.latitude, location.longitude))

St. Louis Skatium, 120, East Catalan Street, Patch, Saint Louis, Missouri, 63111, United States
(38.5396446, -90.26550765004728)


### Remove some of the substrings that cause the geocoder to fail

One common trend that we can see from exploring the `address` column is that a large portion of the address have some suffix info that is at the sub-location level e.g. apartment numbers, floors, unit designations. Let's take these out and see if they improve our performance.

In [13]:
# set up a list containing the string elements we want to remove
to_remove = [' FL ',' APT',' NBR',' RM',' UNIT',' DEPT',' REAR']
tot_ct = 0 # initialize a counter

# copy the existing addresses to a new column to initialize the target of the 'for' loop
msor_nogeo['new_address'] = msor_nogeo['address']
print("Dataframe has",len(msor_nogeo),"entries")

# loop through all the elements in the list, removing each element from the address
for i in to_remove:
    # split() outputs a list of two elements: the part of the address before the match string [i] and the part after
    #    we only care about the part before
    msor_nogeo['split'] = msor_nogeo['new_address'].str.split(i)
    # convert those list items into two columns in a new (temp) df
    # then store the usable column back in the original df
    address_split = pd.DataFrame(msor_nogeo['split'].to_list(), columns=['keep', 'trash'])
    # count how many items we modified by counting the pieces we are threw away
    loop_ct = address_split['trash'].notnull().sum()
    # keep a running total of the modifications we've made
    tot_ct = tot_ct + loop_ct
    # overwrite the "new_address" with the updated value. this can then be used in subsequent loops for new matches.
    msor_nogeo['new_address'] = address_split['keep']
    print('Removed "',i,'" from address',' (',loop_ct,' entries)',sep='')

# drop the split column since we don't need it anymore
msor_nogeo.drop('split', axis=1, inplace=True)

print(tot_ct,'total modifications')

msor_nogeo[['zip','address','full_address','new_address']].head()


Dataframe has 2200 entries
Removed " FL " from address (254 entries)
Removed " APT" from address (1655 entries)
Removed " NBR" from address (4 entries)
Removed " RM" from address (86 entries)
Removed " UNIT" from address (30 entries)
Removed " DEPT" from address (1 entries)
Removed " REAR" from address (2 entries)
2032 total modifications


Unnamed: 0,zip,address,full_address,new_address
0,63108,5067 ENRIGHT APT 1C,"5067 ENRIGHT APT 1C,ST LOUIS,MO",5067 ENRIGHT
1,63108,5067 ENRIGHT APT 1C,"5067 ENRIGHT APT 1C,ST LOUIS,MO",5067 ENRIGHT
2,63108,5067 ENRIGHT APT 1C,"5067 ENRIGHT APT 1C,ST LOUIS,MO",5067 ENRIGHT
3,63116,3764 CHIPPEWA ST APT 8,"3764 CHIPPEWA ST APT 8,SAINT LOUIS,MO",3764 CHIPPEWA ST
4,63116,3866 S SPRING AVE APT 1S,"3866 S SPRING AVE APT 1S,SAINT LOUIS,MO",3866 S SPRING AVE


In [14]:
# work up new addresses that are geocoder-compatible
msor_nogeo['new_address'] = msor_nogeo['new_address'] + "," + msor_nogeo['city'] + "," + msor_nogeo['st']


In [15]:
# count no geocodes (isnull=='True') BEFORE sending to geocoder
msor_nogeo['geocode'].isnull().value_counts()

True    2200
Name: geocode, dtype: int64

In [16]:
# send the updated addresses back to the geocoder
msor_nogeo['geocode'] = msor_nogeo.new_address.apply(geolocator.geocode)

In [17]:
# count no geocodes AFTER sending to geocoder
msor_nogeo['geocode'].isnull().value_counts()

False    2007
True      193
Name: geocode, dtype: int64

### How are we doing?
By looking at the amount of null/"None" elements in the 'geocode' columnn before and after running our updated addresses through the geocoder, we can see a significant improvement in our outcome. We've fixed over 90% of the failed entries! 

Let's visualize the results so far, then save this work by pushing the now-successful entries into the PostGIS database.

In [25]:
# downselect our 'msor_nogeo' gdf to only the items that now have geocodes
# remove rows that do not have location data
msor_nogeo_fixed = msor_nogeo[msor_nogeo['geocode'].notna()].copy()

# find all rows where the geocode still did not populate
# save them in a new df so we can examine them later
msor_nogeo_v2 = msor_nogeo[msor_nogeo['geocode'].isna()].copy()


In [26]:
msor_nogeo_fixed.shape

(2007, 25)

In [27]:
# set up the gdf to visualize the results
# get the latitude and longitude values from the geodata column and put them in their own columns for easier plotting
msor_nogeo_fixed['lat'] = [g.latitude for g in msor_nogeo_fixed.geocode]
msor_nogeo_fixed['long'] = [g.longitude for g in msor_nogeo_fixed.geocode]


#### Render a map that shows all the entries we recovered!

In [28]:
# create a base map centered on St. Louis
map_sexoffenders = folium.Map(
    location=[38.627003, -90.3],
    tiles='cartodbpositron',
    zoom_start=11,
)

# add a marker for each childcare facility
# label each facility with its name
for i in range(0,len(msor_nogeo_fixed)):
   folium.Marker(
      location=[msor_nogeo_fixed.iloc[i]['lat'], msor_nogeo_fixed.iloc[i]['long']],
      popup=msor_nogeo_fixed.iloc[i]['offense']
   ).add_to(map_sexoffenders)

# display the map
map_sexoffenders

#### Append these results to the existing `stlsexoffenders` table

In [34]:
# check the size of the db table before we make any additions
sql = "select count(*) from stlsexoffenders;"
msor_table_ct_before = pd.read_sql_query(sql, conn)
msor_table_ct_before

Unnamed: 0,count
0,3613


In [35]:
# check out the form of the existing 'stlsexoffenders' table 
# we will need to match this structure in order to successfully append our new entries

# query the table and read data into a df 
sql = "select * from stlsexoffenders LIMIT 10;"
msor_table_sample = pd.read_sql_query(sql, conn)
print(msor_table_sample.shape)
print(msor_table_sample.dtypes)
msor_table_sample.head(2)


(10, 24)
randomid                             int64
name                                object
address                             object
city                                object
st                                  object
zip                                  int64
county                              object
offense                             object
offense_city                        object
offense_state                       object
victim_gender                       object
victim_age                           int64
victim_max_age                      object
compliant                           object
tier                                 int64
date_of_birth                       object
offense_date                        object
conviction_date                     object
confinement_release_date            object
probation/parole_release_date       object
offender_age_at_time_of_offense      int64
lat                                float64
long                               float64
ge

Unnamed: 0,randomid,name,address,city,st,zip,county,offense,offense_city,offense_state,...,tier,date_of_birth,offense_date,conviction_date,confinement_release_date,probation/parole_release_date,offender_age_at_time_of_offense,lat,long,geometry
0,118,"ABERNATHY, STEVIE A",133 BAYVIEW DR,SAINT LOUIS,MO,63135,ST LOUIS,STATUTORY RAPE-2ND DEGRE,ST PETERS,MO,...,3,1991-11-11,2014-01-26,2015-12-18,2019-08-23,2023-08-09 00:00:00,22,38.744944,-90.290619,0101000020E61000007078107F999256C0BF66D6525A5F...
1,1904,"ABRAMS, NORVELL L",1946 HEBERT ST,SAINT LOUIS,MO,63107,ST LOUIS CITY,ATTEMPT SEXUAL ABUSE,PAGEDALE,MO,...,2,1973-05-08,2002-10-14,2003-02-21,2006-10-14,2006-10-14 00:00:00,29,38.654602,-90.201029,0101000020E6100000B4D0DBA8DD8C56C0064ED0FFC953...


In [36]:
# compare this to the same info from the df we're been working on
print(msor_nogeo_fixed.shape)
print(msor_nogeo_fixed.dtypes)
msor_nogeo_fixed.head(2)

(2007, 27)
index                                       int64
randomid                                    int64
name                                       object
address                                    object
city                                       object
st                                         object
zip                                         int64
county                                     object
offense                                    object
offense_city                               object
offense_state                              object
victim_gender                              object
victim_age                                  int64
victim_max_age                             object
compliant                                  object
tier                                        int64
date_of_birth                      datetime64[ns]
offense_date                       datetime64[ns]
conviction_date                    datetime64[ns]
confinement_release_date           date

Unnamed: 0,index,randomid,name,address,city,st,zip,county,offense,offense_city,...,offense_date,conviction_date,confinement_release_date,probation/parole_release_date,offender_age_at_time_of_offense,full_address,geocode,new_address,lat,long
0,5,47218,"ABBOTT, STEVEN R",5067 ENRIGHT APT 1C,ST LOUIS,MO,63108,ST LOUIS CITY,CHILD MOLEST-1ST DEGREE,DUNKLIN,...,2010-01-23,2012-02-23,2019-07-30,2023-06-23,27,"5067 ENRIGHT APT 1C,ST LOUIS,MO","(5067, Enright Avenue, Academy, Cabanne Place,...","5067 ENRIGHT,ST LOUIS,MO",38.653205,-90.265823
1,6,1021,"ABBOTT, STEVEN R",5067 ENRIGHT APT 1C,ST LOUIS,MO,63108,ST LOUIS CITY,SEXUAL MISCONDUCT-1ST,KENNETT,...,2002-02-19,2002-06-21,NaT,NaT,19,"5067 ENRIGHT APT 1C,ST LOUIS,MO","(5067, Enright Avenue, Academy, Cabanne Place,...","5067 ENRIGHT,ST LOUIS,MO",38.653205,-90.265823


Comparing the outputs above, we need to...  
1. **Remove** index, full_address, geocode, new_address
2. **Convert** our dataframe into a _geo_dataframe

In [38]:
# before we start dropping columns, copy the dataframe just in case
msor_db = msor_nogeo_fixed.copy()

In [39]:
# 1. drop columns that we don't need
msor_db.drop(['index','full_address','geocode','new_address'], inplace=True, axis=1)

In [40]:
# 2. convert dataframe into a geodataframe in order for it to work correctly with PostGIS

# create the 'geometry' column for the geodataframe
geometry = [Point(xy) for xy in zip(msor_db['long'], msor_db['lat'])]
# generate the geodataframe using the msor df + the geometry info
# set the CRS (in degrees) as part of this process
msor_db = gpd.GeoDataFrame(msor_db, geometry = geometry, crs=4326) 


In [41]:
# load the data!

# Set up database connection engine
# FORMAT: engine = create_engine('postgresql://user:password@host:5432/')
engine = create_engine(f'postgresql://psmd39:{mypasswd}@pgsql.dsa.lan:5432/cappsds_psmd39', echo=False)

# GeoDataFrame to PostGIS
msor_db.to_postgis(
    con=engine,
    name="stlsexoffenders",
    if_exists='append' # note that we are APPENDING this new info to the existing table
)


In [29]:
engine.dispose() 

**Our database table now contains the original entries plus the newly-fixed entries.**  
However, we still have a few items that have not been successfully geocoded. Let's store those in a NEW table so we can work on them later without running through this entire process again.

In [30]:
# Set up database connection engine
engine = create_engine(f'postgresql://psmd39:{mypasswd}@pgsql.dsa.lan:5432/cappsds_psmd39', echo=False)

# DataFrame to PostgreSQL
msor_nogeo_v2.to_sql(
    con=engine,
    name="msorfailedgeocodingv2",
    if_exists='replace'
)

In [42]:
# check the size of the db table
sql = "select count(*) from stlsexoffenders;"
msor_table_ct_after = pd.read_sql_query(sql, conn)
msor_table_ct_after

Unnamed: 0,count
0,5620


In [31]:
engine.dispose() 

## Failed geocoding redux
Can we get the amount of geocoding failures even closer to zero? Potential trouble areas in the remaining data:

- Cardinal direction letters e.g. "N","S","E","W"
- Road suffixes e.g. "AVE","RD","BLVD"

**NOTE:** We are doing all these in a second pass (vs. rolling into the above work) because making too many changes at once has an adverse effect on many entries. That is, removing things like " FL" (up above) was enough to get those items to geocode, but removing *more* info like "AVE FL" could cause those same items to fail. Thus, this iterative approach is needed.

In [43]:
# query the table and read data into a df 
sql = "select * from msorfailedgeocodingv2;"
msor_nogeo_redux = pd.read_sql_query(sql, conn)
print(msor_nogeo_redux.shape)


(193, 26)


In [44]:
# take a look at the different address info we've worked up in the past
msor_nogeo_redux[['zip','address','full_address','new_address']].head()


Unnamed: 0,zip,address,full_address,new_address
0,63111,120 W CATALAN AVE APT 201,"120 W CATALAN AVE APT 201,ST LOUIS,MO","120 W CATALAN AVE,ST LOUIS,MO"
1,63111,120 W CATALAN AVE APT 201,"120 W CATALAN AVE APT 201,ST LOUIS,MO","120 W CATALAN AVE,ST LOUIS,MO"
2,63111,120 W CATALAN AVE APT 201,"120 W CATALAN AVE APT 201,ST LOUIS,MO","120 W CATALAN AVE,ST LOUIS,MO"
3,63111,120 W CATALAN AVE APT 201,"120 W CATALAN AVE APT 201,ST LOUIS,MO","120 W CATALAN AVE,ST LOUIS,MO"
4,63144,2631 SALEM RD,"2631 SALEM RD,SAINT LOUIS,MO","2631 SALEM RD,SAINT LOUIS,MO"


In [45]:
# we will be working primarily off of the 'new_address' column in order to benefit from the earlier modifications
# - again, we're iterating here

# remove some of the elements that trip up the geocoder, anchoring to the comma in the new_address column
# this helps us avoid unwanted replacements elsewhere
msor_nogeo_redux['new_address'] = msor_nogeo_redux['new_address'].str.replace(' RD,',',')
msor_nogeo_redux['new_address'] = msor_nogeo_redux['new_address'].str.replace(' AVE,',',')
msor_nogeo_redux['new_address'] = msor_nogeo_redux['new_address'].str.replace(' DR,',',')
msor_nogeo_redux['new_address'] = msor_nogeo_redux['new_address'].str.replace(' ST,',',')
msor_nogeo_redux['new_address'] = msor_nogeo_redux['new_address'].str.replace(' BLVD,',',')
msor_nogeo_redux['new_address'] = msor_nogeo_redux['new_address'].str.replace(' LN,',',')
msor_nogeo_redux['new_address'] = msor_nogeo_redux['new_address'].str.replace(' TRAK,',',')

# remove all cardinal directions
msor_nogeo_redux['new_address'] = msor_nogeo_redux['new_address'].str.replace(' N ',' ')
msor_nogeo_redux['new_address'] = msor_nogeo_redux['new_address'].str.replace(' S ',' ')
msor_nogeo_redux['new_address'] = msor_nogeo_redux['new_address'].str.replace(' E ',' ')
msor_nogeo_redux['new_address'] = msor_nogeo_redux['new_address'].str.replace(' W ',' ')


In [46]:
# count no geocodes (isnull=='True') BEFORE sending to geocoder
msor_nogeo_redux['geocode'].isnull().value_counts()

True    193
Name: geocode, dtype: int64

In [47]:
# send the updated addresses back to the geocoder
msor_nogeo_redux['geocode'] = msor_nogeo_redux.new_address.apply(geolocator.geocode)

In [48]:
# count no geocodes AFTER sending to geocoder
msor_nogeo_redux['geocode'].isnull().value_counts()

False    107
True      86
Name: geocode, dtype: int64

We were able to recover over half of the remaining items! Let's get them into the database and wrap up this work.

In [49]:
# downselect our 'msor_nogeo' gdf to only the items that now have geocodes
# remove rows that do not have location data
msor_nogeo_fixed_redux = msor_nogeo_redux[msor_nogeo_redux['geocode'].notna()].copy()

# find all rows where the geocode still did not populate
# save them in a new df so we can examine them later
msor_nogeo_v3 = msor_nogeo_redux[msor_nogeo_redux['geocode'].isna()].copy()


In [52]:
# set up the gdf to visualize the results
# get the latitude and longitude values from the geodata column and put them in their own columns for easier plotting
msor_nogeo_fixed_redux['lat'] = [g.latitude for g in msor_nogeo_fixed_redux.geocode]
msor_nogeo_fixed_redux['long'] = [g.longitude for g in msor_nogeo_fixed_redux.geocode]


#### Render a map that shows all the new entries we recovered!

In [53]:
# create a base map centered on St. Louis
map_sexoffenders_redux = folium.Map(
    location=[38.627003, -90.3],
    tiles='cartodbpositron',
    zoom_start=11,
)

# add a marker for each childcare facility
# label each facility with its name
for i in range(0,len(msor_nogeo_fixed_redux)):
   folium.Marker(
      location=[msor_nogeo_fixed_redux.iloc[i]['lat'], msor_nogeo_fixed_redux.iloc[i]['long']],
      popup=msor_nogeo_fixed_redux.iloc[i]['offense']
   ).add_to(map_sexoffenders_redux)

# display the map
map_sexoffenders_redux

#### Append these results to the existing table

In [54]:
# compare this to the same info from the df we're been working on
print(msor_nogeo_fixed_redux.shape)
msor_nogeo_fixed_redux.head(2)

(107, 28)


Unnamed: 0,level_0,index,randomid,name,address,city,st,zip,county,offense,...,offense_date,conviction_date,confinement_release_date,probation/parole_release_date,offender_age_at_time_of_offense,full_address,geocode,new_address,lat,long
0,13,283,41206,"ALDRIDGE, SAMUEL A",120 W CATALAN AVE APT 201,ST LOUIS,MO,63111,ST LOUIS CITY,ATTEMPT RAPE,...,1985-03-31,1985-10-02,NaT,NaT,20,"120 W CATALAN AVE APT 201,ST LOUIS,MO","(St. Louis Skatium, 120, East Catalan Street, ...","120 CATALAN,ST LOUIS,MO",38.539645,-90.265508
1,14,284,54884,"ALDRIDGE, SAMUEL A",120 W CATALAN AVE APT 201,ST LOUIS,MO,63111,ST LOUIS CITY,CHLD MOLST-2ND DEG-INJRY,...,2008-02-22,2008-11-26,2010-10-21,NaT,43,"120 W CATALAN AVE APT 201,ST LOUIS,MO","(St. Louis Skatium, 120, East Catalan Street, ...","120 CATALAN,ST LOUIS,MO",38.539645,-90.265508


Comparing the outputs above, we need to...  
1. **Remove** level_0, index, full_address, geocode, new_address
2. **Convert** our dataframe into a _geo_dataframe

In [58]:
# before we start dropping columns, copy the dataframe just in case
msor_db = msor_nogeo_fixed_redux.copy()


In [59]:
# 1. drop columns that we don't need
msor_db.drop(['level_0','index','full_address','geocode','new_address'], inplace=True, axis=1)

In [60]:
# 2. convert dataframe into a geodataframe in order for it to work correctly with PostGIS

# create the 'geometry' column for the geodataframe
geometry = [Point(xy) for xy in zip(msor_db['long'], msor_db['lat'])]
# generate the geodataframe using the msor df + the geometry info
# set the CRS (in degrees) as part of this process
msor_db = gpd.GeoDataFrame(msor_db, geometry = geometry, crs=4326) 


In [61]:
# load the data!

# Set up database connection engine
# FORMAT: engine = create_engine('postgresql://user:password@host:5432/')
engine = create_engine(f'postgresql://psmd39:{mypasswd}@pgsql.dsa.lan:5432/cappsds_psmd39', echo=False)

# GeoDataFrame to PostGIS
msor_db.to_postgis(
    con=engine,
    name="stlsexoffenders",
    if_exists='append' # note that we are APPENDING this new info to the existing table
)


In [62]:
# check the size of the db table
sql = "select count(*) from stlsexoffenders;"
msor_redux_table_ct_after = pd.read_sql_query(sql, conn)
msor_redux_table_ct_after

Unnamed: 0,count
0,5727


In [63]:
#close connection to the db
conn.close()
engine.dispose()

# Summary

We've successfull added as many of the sex offender locations to our PostGIS database as is reasonable. The remaining few (86 out of approximately 5,800) should not have a meaningful impact on our analysis. Some of these remaining items, such as "complaint/pending registration" and "homeless" aren't actual addresses and thus will never geocode. It's time to move on to additional work.