**Evictions Dataset Description**

This dataset lists pending, scheduled, and executed evictions within the five boroughs,
for the year 2017 - early 2022. The data fields may be sorted by Court Index Number,
Docket Number, Eviction Address, Apartment Number, Executed Date, Marshal First
Name, Marshal Last Name, Residential or Commercial (property type), Borough, Zip
Code and Scheduled Status (Pending/Scheduled).

1. Import pandas  

In [None]:
import pandas as pd

2. Read CSV into DataFrame

In [None]:
evictions_df = pd.read_csv('Evictions.csv')
evictions_df

Unnamed: 0,Court Index Number,Docket Number,Eviction Address,Eviction Apartment Number,Executed Date,Marshal First Name,Marshal Last Name,Residential/Commercial,BOROUGH,Eviction Postcode,Ejectment,Eviction/Legal Possession,Latitude,Longitude,Community Board,Council District,Census Tract,BIN,BBL,NTA
0,50365/19,352435,319 WEST 94TH STREET,C103,03/25/2019,Thomas,Bia,Residential,MANHATTAN,10025,Not an Ejectment,Possession,40.794205,-73.974734,7.0,6.0,183.0,1034178.0,1.012530e+09,Upper West Side
1,B068159/16,379048,2332 CRESTON AVE,41,04/25/2017,Richard,McCoy,Residential,BRONX,10468,Not an Ejectment,Possession,40.858643,-73.900402,5.0,14.0,23703.0,2013777.0,2.031640e+09,Fordham South
2,59891/16,320691,2670 BAINBRIDGE AVENUE,2F,02/21/2017,John,Villanueva,Residential,BRONX,10458,Not an Ejectment,Possession,40.865470,-73.891472,7.0,15.0,40502.0,2016620.0,2.032870e+09,Bedford Park-Fordham North
3,75708/18,115775,18-24 25TH ROAD,1,04/23/2019,Maxine,Chevlowe,Residential,QUEENS,11102,Not an Ejectment,Possession,40.774861,-73.926140,1.0,22.0,91.0,4019956.0,4.008870e+09,Old Astoria
4,900940/18,86395,3005 EASTCHESTER RO AD,STOREFRONT,11/08/2018,Justin,Grossman,Commercial,BRONX,10469,Not an Ejectment,Possession,40.869607,-73.842766,12.0,12.0,358.0,2061802.0,2.047620e+09,Eastchester-Edenwald-Baychester
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
67723,R 50452/17,68450,182 ARLINGTON AVENUE,,03/29/2017,Steven,Powell,Residential,STATEN ISLAND,10303,Not an Ejectment,Possession,40.635485,-74.167531,1.0,49.0,323.0,5028488.0,5.012670e+09,Mariner's Harbor-Arlington-Port Ivory-Granitev...
67724,N66489/17,81749,600 WEST 136TH STREE T,6F,09/27/2017,Ileana,Rivera,Residential,MANHATTAN,10031,Not an Ejectment,Possession,40.820935,-73.954996,9.0,7.0,22301.0,1059959.0,1.020020e+09,Manhattanville
67725,60675/17-1,208636,115 LINCOLN ROAD,3N,01/09/2020,Richard,Capuano,Residential,BROOKLYN,11225,Not an Ejectment,Possession,40.661077,-73.958799,9.0,40.0,79801.0,3379187.0,3.013270e+09,Prospect Lefferts Gardens-Wingate
67726,B42335/17,83318,1131-1133 OGDEN AVEN UE,12A,12/13/2017,Ileana,Rivera,Residential,BRONX,10452,Not an Ejectment,Possession,40.836286,-73.927496,4.0,16.0,199.0,2088158.0,2.025260e+09,Highbridge


3. Explore the data and its structure

In [None]:
# .shape shows the number of rows and columns
evictions_df.shape

(67728, 20)

In [None]:
# .dtypes shows the data types
evictions_df.dtypes

Court Index Number            object
Docket Number                  int64
Eviction Address              object
Eviction Apartment Number     object
Executed Date                 object
Marshal First Name            object
Marshal Last Name             object
Residential/Commercial        object
BOROUGH                       object
Eviction Postcode              int64
Ejectment                     object
Eviction/Legal Possession     object
Latitude                     float64
Longitude                    float64
Community Board              float64
Council District             float64
Census Tract                 float64
BIN                          float64
BBL                          float64
NTA                           object
dtype: object

In [None]:
# .isnull() detects missing values
# To get the total summation of all missing values in the dataframe,
# chain two .sum() methods together
evictions_df.isnull().sum().sum()

62997

The DataFrame contains 62997 missing values. We want to further investigate where these missing values are.  

In [None]:
# .isna() detects missing values in columns
evictions_df.isna().sum()

Court Index Number               0
Docket Number                    0
Eviction Address                 0
Eviction Apartment Number    11209
Executed Date                    0
Marshal First Name               0
Marshal Last Name                0
Residential/Commercial           0
BOROUGH                          0
Eviction Postcode                0
Ejectment                        0
Eviction/Legal Possession        0
Latitude                      6444
Longitude                     6444
Community Board               6444
Council District              6444
Census Tract                  6444
BIN                           6562
BBL                           6562
NTA                           6444
dtype: int64

Missing values:


*   11209 missing values in the 'Eviction Apartment Number' column
*   6444 missing values in the 'Latitude' column
* 6444 missing values in the 'Longitude' column
* 6444 missing values in the 'Community Board' column
* 6444 missing values in the 'Council District' column
* 6444 missing values in the 'Census Tract' column
* 6562 missing values in the 'BIN' column
* 6562 missing values in the 'BBL' column
* 6444 missing values in the 'NTA' column



4. Investigate the missing values and clean the data

In [None]:
# Missing values in the Eviction Apartment Number column
# Not all Eviction Addresses are apartments
# Hence we will ignore this

In [None]:
# Missing values in the Latitude and Longitude column
# We have the eviction address, borough, and postcode
# We can perform geocoding in Python using the Geopy Library
# Install the library using Pip
# Documentation: https://geopy.readthedocs.io/en/latest/#geocoders

!pip install geopy

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/


In [None]:
# Creating a new column in the DataFrame to hold the eviction full address
evictions_df.loc[:, 'full_address'] = evictions_df['Eviction Address'] + "," + evictions_df['BOROUGH'] + ",NY," + evictions_df['Eviction Postcode'].astype(str)
evictions_df.head()

Unnamed: 0,Court Index Number,Docket Number,Eviction Address,Eviction Apartment Number,Executed Date,Marshal First Name,Marshal Last Name,Residential/Commercial,BOROUGH,Eviction Postcode,...,Eviction/Legal Possession,Latitude,Longitude,Community Board,Council District,Census Tract,BIN,BBL,NTA,full_address
0,50365/19,352435,319 WEST 94TH STREET,C103,03/25/2019,Thomas,Bia,Residential,MANHATTAN,10025,...,Possession,40.794205,-73.974734,7.0,6.0,183.0,1034178.0,1012530000.0,Upper West Side,"319 WEST 94TH STREET,MANHATTAN,NY,10025"
1,B068159/16,379048,2332 CRESTON AVE,41,04/25/2017,Richard,McCoy,Residential,BRONX,10468,...,Possession,40.858643,-73.900402,5.0,14.0,23703.0,2013777.0,2031640000.0,Fordham South,"2332 CRESTON AVE,BRONX,NY,10468"
2,59891/16,320691,2670 BAINBRIDGE AVENUE,2F,02/21/2017,John,Villanueva,Residential,BRONX,10458,...,Possession,40.86547,-73.891472,7.0,15.0,40502.0,2016620.0,2032870000.0,Bedford Park-Fordham North,"2670 BAINBRIDGE AVENUE,BRONX,NY,10458"
3,75708/18,115775,18-24 25TH ROAD,1,04/23/2019,Maxine,Chevlowe,Residential,QUEENS,11102,...,Possession,40.774861,-73.92614,1.0,22.0,91.0,4019956.0,4008870000.0,Old Astoria,"18-24 25TH ROAD,QUEENS,NY,11102"
4,900940/18,86395,3005 EASTCHESTER RO AD,STOREFRONT,11/08/2018,Justin,Grossman,Commercial,BRONX,10469,...,Possession,40.869607,-73.842766,12.0,12.0,358.0,2061802.0,2047620000.0,Eastchester-Edenwald-Baychester,"3005 EASTCHESTER RO AD,BRONX,NY,10469"


In [None]:
# Use the Nominatim geocoding service
# There are many geocoding services so why this one?
# This one does not require an API key to access

# Note - Limitation of Nominatim Usage - maximum of 1 request per second
# We have a total of 67728 rows - 67728 secs = 18.81 hrs
# We have 6444 values missing in the lat and long columns
# This equivalates to a runtime of 6444 secs = 1.79 hrs

# Approach - we could loop through the DataFrame and only populate the
# geocodes for the missing values and extract the lat and long,
# but this would still take approximately 2 hours

# // the line below populates geocode for all values
# evictions_df['geocodes'] = evictions_df.full_address.apply(geolocator.geocode)

Python Geopy Nominatim too many requests 403 error

Stackoverflow Solution: https://stackoverflow.com/questions/60083187/python-geopy-nominatim-too-many-requests

In [None]:
import logging
from time import sleep
from random import randint
from geopy.geocoders import Nominatim
from geopy.exc import GeocoderTimedOut, GeocoderServiceError

# Generate a random user agent to be used with the geolocator
user_agent = 'user_me_{}'.format(randint(10000,99999))
# Create a geolocator object using the Nominatim geocoder with the user agent
geolocator = Nominatim(user_agent=user_agent)

def reverse_geocode(geolocator, latlon, sleep_sec):
    try:
        # Try to get the location information from the geolocator
        return geolocator.geocode(latlon)
    except GeocoderTimedOut:
        # If there is a timeout error, log the message and retry after waiting for some random time
        logging.info('TIMED OUT: GeocoderTimedOut: Retrying...')
        sleep(randint(1*100,sleep_sec*100)/100)
        return reverse_geocode(geolocator, latlon, sleep_sec)
    except GeocoderServiceError as e:
        # If there is a service error, log the message and return None
        logging.info('CONNECTION REFUSED: GeocoderServiceError encountered.')
        logging.error(e)
        return None
    except Exception as e:
        # If there is any other exception, log the message and return None
        logging.info('ERROR: Terminating due to exception {}'.format(e))
        return None


In [None]:
# Iterate through each row of the evictions_df DataFrame
for i, row in evictions_df.iterrows():
    # If the Latitude or Longitude values are NaN
    if pd.isna(row['Latitude']) or pd.isna(row['Longitude']):
        # Use the reverse_geocode function to get the location information from the geolocator
        location = reverse_geocode(geolocator, row['full_address'], 1)
        # If location information is obtained successfully
        if location is not None:
            # Set the Latitude value of the corresponding row of the DataFrame
            evictions_df.at[i, 'Latitude'] = location.latitude
            # Set the Longitude value of the corresponding row of the DataFrame
            evictions_df.at[i, 'Longitude'] = location.longitude

ERROR:root:HTTPSConnectionPool(host='nominatim.openstreetmap.org', port=443): Max retries exceeded with url: /search?q=34-16+88+STREET+++2ND+FLOOR+FRONT+ROOM+LEFT%2CQUEENS%2CNY%2C11372&format=json&limit=1 (Caused by ReadTimeoutError("HTTPSConnectionPool(host='nominatim.openstreetmap.org', port=443): Read timed out. (read timeout=1)"))
ERROR:root:HTTPSConnectionPool(host='nominatim.openstreetmap.org', port=443): Max retries exceeded with url: /search?q=1165+BROADWAY+STORE+%234+AND+ROOM+204%2CMANHATTAN%2CNY%2C10001&format=json&limit=1 (Caused by ReadTimeoutError("HTTPSConnectionPool(host='nominatim.openstreetmap.org', port=443): Read timed out. (read timeout=1)"))
ERROR:root:HTTPSConnectionPool(host='nominatim.openstreetmap.org', port=443): Max retries exceeded with url: /search?q=119-05+METROPOLITAN++AVENUE++ON+THE+2ND%2CQUEENS%2CNY%2C11415&format=json&limit=1 (Caused by ReadTimeoutError("HTTPSConnectionPool(host='nominatim.openstreetmap.org', port=443): Read timed out. (read timeout=1)

In [None]:
# View the changes in the number of missing values in Latitude and Longitude columns

# .isna() detects missing values in columns
evictions_df.isna().sum()

Court Index Number               0
Docket Number                    0
Eviction Address                 0
Eviction Apartment Number    11209
Executed Date                    0
Marshal First Name               0
Marshal Last Name                0
Residential/Commercial           0
BOROUGH                          0
Eviction Postcode                0
Ejectment                        0
Eviction/Legal Possession        0
Latitude                      6000
Longitude                     6000
Community Board               6444
Council District              6444
Census Tract                  6444
BIN                           6562
BBL                           6562
NTA                           6444
full_address                     0
dtype: int64

In [None]:
evictions_df.to_csv("evictions_geocode_version.csv", index=False)