# A Price Prediction Web App - ETL Notebook 1

The central steps in the ETL process for this project are:  
1. convert state plane coordinates, from the Dep. of City Planning, into standard Lat Long coordinates (This notebook)
2. aggregate ~ 60 Dep. of Finance sales data sets - spread across five boroughs and twelve years (Notebook 2)
3. Merge the Dep. of Finance and Dep. of City Planning Datasets (Notebook 2)

### Locations of the three sources: 
1. NYC Dep. of City Planning (has GPS coordinates for all NYC properties):  
   http://www1.nyc.gov/site/planning/data-maps/open-data/dwn-pluto-mappluto.page
2. NYC Dep. of Finance (tracks sales of all NYC properties):  
   https://www1.nyc.gov/site/finance/taxes/property-annualized-sales-update.page
3. NYTimes Real Estate Section (current listings on the market):      
   https://www.nytimes.com/section/realestate  
   The web scraper I built to pull listings off of the nytimes website: 
   https://github.com/MDHRDY/A_NY_Times_Real_Estate_Web_Scraper

## Notebook 1: Generate GPS Coordinates From Dep. of Finance Dataset

### Outline
1 [Load Libraries and Dep. of City Planning datasets](#load)  
2 [Convert State Plane Coordinates to Lat Long Coordinates](#gps)  
3 [Save dataframe w/ converted coordinates to file](#to_file)

<a id='load'></a>
### Load Libraries and Dep. of City Planning datasets

In [10]:
import pandas
import numpy as np
import warnings
warnings.filterwarnings('ignore')
options.display.max_columns = 100

Read downloaded Dep. of Finance csv from each borough into dataframe

In [7]:
path = "/masked/nyc_project/csvs/nyc_pluto_16v1/"
bk = read_csv(path + 'BK.csv', low_memory=False)
bx = read_csv(path + 'BX.csv', low_memory=False)
mn = read_csv(path + 'MN.csv', low_memory=False)
qn = read_csv(path + 'QN.csv', low_memory=False)
si = read_csv(path + 'SI.csv', low_memory=False)

Label each borough before aggregating together

In [8]:
mn['B'] = 1
bx['B'] = 2
bk['B'] = 3
qn['B'] = 4
si['B'] = 5

boroughs = [mn,bx,bk,qn, si]
Lat_Long_File = concat(boroughs)
print 'Shape of master dataframe: ',Lat_Long_File.shape

Shape of master dataframe:  (859205, 87)


<a id='gps'></a>
### Convert State Plane Coordinates to Lat Long Coordinates

In [14]:
print 'Detect missing values: ', Lat_Long_File.XCoord.isnull().sum()

Detect missing values:  25023


pyproj -> performs cartographic transformations 

Convert between the following two gps systems:  
wgs84 -> The World Geodetic System - latest revision 1984  
spft -> state plane feet coordinate system (used by Dep. of Finance)

In [23]:
import pyproj as pp
import math
wgs84  = pp.Proj("+init=EPSG:4326")
spft = pp.Proj("+init=ESRI:102318 +units=us-ft", preserve_units=True)

np.isfinite tests element-wise for finiteness

In [21]:
df = Lat_Long_File[np.isfinite(Lat_Long_File['XCoord'])]
print "check that isfinite values + null values comprises all values: "
print "master dataframe: ", Lat_Long_File.shape[0]
print "null values: ",Lat_Long_File.XCoord.isnull().sum() 
print "isfinite values: ",df.shape[0]

check that isfinite values + null values comprises all values: 
master dataframe:  859205
null values:  25023
isfinite values:  834182


Transform coordinates ...

In [24]:
print 'pre-transformation dataframe shape: ', df.shape
x = df['XCoord'].tolist()
x = map(int, x)
y = df['YCoord'].tolist()
y = map(int, y)


transformed_x, transformed_y = pp.transform(spft, wgs84, x,y)

df['Long'] = transformed_x
df['Lat'] = transformed_y
print "post-transformation dataframe shape: ", df.shape

pre-transformation dataframe shape:  (834182, 87)
post-transformation dataframe shape:  (834182, 89)


In [13]:
print "Number of unique longitude entries: ", len(df['Long'].unique())

Number of unique longitude entries:  834154


<a id='load'></a>
### Write results to file: 

In [15]:
path2 = "/masked/nyc_project/"
df.to_csv(path2 + 'All_tax_classes_Lat_Long_w_89_variables.csv', index=False)