In [1]:
import pandas as pd

Reading the raw data from csv which has been downloaded from https://data.cityofnewyork.us/Education/Brooklyn-Schools/bkjd-kr4k.
See also download.ipynb which can do this if file is not present.

In [2]:
df = pd.read_csv('../data/1_original/brooklyn_schools.csv', ',')

Drop superfluous columns

In [3]:
df.drop(columns=['DBN', 'Location Code', 'Building Name', 'Borough', 'Geographical District Code', 'Schools in Building', 'ENGroupA', 'Register', '# Schools', 'Major N', 'Oth N', 'NoCrim N', 'Prop N', 'Vio N', 'AvgOfMajor N', 'AvgOfOth N', 'AvgOfNoCrim N', 'AvgOfProp N', 'AvgOfVio N'], inplace=True)
df.head()

Unnamed: 0,Building Code,Location Name,Address,RangeA
0,K001,P.S. 001 The Bergen,309 47 STREET,1251-1500
1,K002,Parkside Preparatory Academy,655 PARKSIDE AVENUE,251-500
2,K002,P.S. K141,655 PARKSIDE AVENUE,251-500
3,K002,Explore Charter School,655 PARKSIDE AVENUE,
4,K002,655 PARKSIDE AVENUE CONSOLIDATED LOCATION,655 PARKSIDE AVENUE,751-1000


Delete all rows that have NaN values in either the 'Major N' or 'RangeA' column

In [4]:
df.dropna(subset=['RangeA'], inplace=True)
df.head()

Unnamed: 0,Building Code,Location Name,Address,RangeA
0,K001,P.S. 001 The Bergen,309 47 STREET,1251-1500
1,K002,Parkside Preparatory Academy,655 PARKSIDE AVENUE,251-500
2,K002,P.S. K141,655 PARKSIDE AVENUE,251-500
4,K002,655 PARKSIDE AVENUE CONSOLIDATED LOCATION,655 PARKSIDE AVENUE,751-1000
5,K003,P.S. 003 The Bedford Village,50 JEFFERSON AVENUE,501-750


Parse of the 'RangeA' (population of building) column.
The value is range like 751-1000. Try to split the range and use the second (the larger value) value. There are also value in form of 4000+. Just split by '+' and use the first value.
Write all the values back to the 'RangeA' column.

In [5]:
def map_range_a(value):
    try:
        return int(value.split('-')[1].strip())
    except IndexError:
        return int(value.split('+')[0])

df['RangeA'] = df['RangeA'].apply(map_range_a)
df.head()

Unnamed: 0,Building Code,Location Name,Address,RangeA
0,K001,P.S. 001 The Bergen,309 47 STREET,1500
1,K002,Parkside Preparatory Academy,655 PARKSIDE AVENUE,500
2,K002,P.S. K141,655 PARKSIDE AVENUE,500
4,K002,655 PARKSIDE AVENUE CONSOLIDATED LOCATION,655 PARKSIDE AVENUE,1000
5,K003,P.S. 003 The Bedford Village,50 JEFFERSON AVENUE,750


There are many schools that are located in the same building on the same address. We summarize these rows and take the maximum of RangeA in each group. This also preserves the address.

In [6]:
df = df.groupby(['Building Code', 'Address'], as_index=False).max()
df.head()

Unnamed: 0,Building Code,Address,Location Name,RangeA
0,K001,309 47 STREET,P.S. 001 The Bergen,1500
1,K002,655 PARKSIDE AVENUE,Parkside Preparatory Academy,1000
2,K003,50 JEFFERSON AVENUE,P.S. 003 The Bedford Village,750
3,K005,820 HANCOCK STREET,P.S. 005 Dr. Ronald Mcnair,500
4,K006,43 SNYDER AVENUE,P.S. 006,750


Get the longitude and latitude of each address by requesting an openstreetmap service. To speed up subsequent runs, we store the results in a JSON file and use this file as a cache for later runs.

In [7]:
import importlib
import address_tools

# hack to reload change module in yupiter
importlib.reload(address_tools)

def map_row(row):
    coordinates = address_tools.get_coordinates_from_address(row['Address'], resolve=False)
    row['lon'] = coordinates.lon
    row['lat'] = coordinates.lat
    row['valid_coordinates'] = coordinates.is_valid()
    row['in_brooklyn'] = address_tools.brooklyn.contains(coordinates)

    return row

In [8]:
df = df.apply(map_row, axis=1)
df = df[df['in_brooklyn'] == True]
df.to_csv('../data/3_prepared/schools.csv', ';')
df.head()


Unnamed: 0,Building Code,Address,Location Name,RangeA,lon,lat,valid_coordinates,in_brooklyn
0,K001,309 47 STREET,P.S. 001 The Bergen,1500,-73.999662,40.637904,True,True
1,K002,655 PARKSIDE AVENUE,Parkside Preparatory Academy,1000,-73.951524,40.656441,True,True
2,K003,50 JEFFERSON AVENUE,P.S. 003 The Bedford Village,750,-73.955544,40.682302,True,True
3,K005,820 HANCOCK STREET,P.S. 005 Dr. Ronald Mcnair,500,-73.922603,40.685546,True,True
4,K006,43 SNYDER AVENUE,P.S. 006,750,-73.956111,40.648889,True,True
