## Wrangling public school location data 

### Goals of the Task



There are two tables in the dataset retrieved from the Seattle open data portal <br>
*Each row in both files is a school.* <br>
The csv file is designed to be used in a geospatial tool with an X,Y plotted map image, which we dont have access to. The json file is a dictionary that does contains location coordinates along with other information about each school but the information is nested (dictionaries within dictionaries). This file will need to be unnested in order to make a readable dataframe of longitude and latitude data, which can be joined to the schools data. 

- We can potentially use this data to identify how many public schools lie within walking distance of each cycle hire station and how far away the nearest public school is from a given cycle hire station. This information could be useful for estimating school related demand on the cycle hire network in term time, versus school holidays. 

#### Step 1 : use pandas to read the schools csv file as a data frame 
- import pandas as pd 
- use pandas read_csv to create a schools data frame
- ensure you are pointing at the correct file path for the data source (you may have to navigate in your notebook!) 


In [17]:
import pandas as pd

In [31]:
sites = pd.read_csv("Seattle_Public_Schools_Sites_2022-2023.csv")

In [32]:
sites.head()

Unnamed: 0,X,Y,OBJECTID,schID,schName,mapLabel,Status,esmshs
0,1281074.0,262373.910089,1,106,Jane Addams,Jane Addams,MS,MS
1,1275769.0,263746.860099,2,292,Hazel Wolf K-8,Hazel Wolf K-8,Option ELEM,ES
2,1286988.0,185254.069906,3,264,Rainier View,Rainier View,ELEM,ES
3,1258790.0,189752.729872,4,203,Arbor Heights,Arbor Heights,ELEM,ES
4,1288217.0,191052.880159,5,221,Emerson,Emerson,ELEM,ES


In [33]:
sites.drop(['X','Y','mapLabel','Status'], axis = 1, inplace=True)

In [34]:
sites.head()

Unnamed: 0,OBJECTID,schID,schName,esmshs
0,1,106,Jane Addams,MS
1,2,292,Hazel Wolf K-8,ES
2,3,264,Rainier View,ES
3,4,203,Arbor Heights,ES
4,5,221,Emerson,ES


In [35]:
sites.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 117 entries, 0 to 116
Data columns (total 4 columns):
 #   Column    Non-Null Count  Dtype 
---  ------    --------------  ----- 
 0   OBJECTID  117 non-null    int64 
 1   schID     117 non-null    int64 
 2   schName   117 non-null    object
 3   esmshs    100 non-null    object
dtypes: int64(2), object(2)
memory usage: 3.8+ KB


#### Step 2 : drop unnecessary columns 

Remove the the X, Y coordinates, map label and status columns from the dataframe using a slice or selection method. 

Then, use head() and info() to preview the remaining schools dataframe 

In [40]:
import json

In [41]:
geo = 'Seattle_Public_Schools_Sites_2022-2023.geojson'

#### Step 3 : read in the geojson file through the path
- import the json library 
- set the file path as a variable, for example: 
<blockquote>
    path = 'data/Seattle_Public_Schools_Sites_2022-2023.geojson'<br>
</blockquote>  

- open the file using json.load and the file path 

<blockquote>
    with open(path) as f: <br>
        -> schoolsdict = json.load(f) <br>
        -> print(schoolsdict)
</blockquote>

- review the schoolsdict variable by eye and look for the nested dictionary structure. You should see that the file contains (at the uppermost level) 4 keys - 'type', 'name', 'crs', 'features' and there are sub dictionaries nested under each key, but it is hard to read!

In [38]:
path = 'data/Seattle_Public_Schools_Sites_2022-2023.geojson'

In [42]:
with open ('Seattle_Public_Schools_Sites_2022-2023.geojson') as f:
    geo = json.load(f)
    print(geo)

{'type': 'FeatureCollection', 'name': 'Seattle_Public_Schools_Sites_2022-2023', 'crs': {'type': 'name', 'properties': {'name': 'urn:ogc:def:crs:OGC:1.3:CRS84'}}, 'features': [{'type': 'Feature', 'properties': {'OBJECTID': 1, 'schID': 106, 'schName': 'Jane Addams', 'mapLabel': 'Jane Addams', 'Status': 'MS', 'esmshs': 'MS'}, 'geometry': {'type': 'Point', 'coordinates': [-122.293009024934, 47.709944861847]}}, {'type': 'Feature', 'properties': {'OBJECTID': 2, 'schID': 292, 'schName': 'Hazel Wolf K-8', 'mapLabel': 'Hazel Wolf K-8', 'Status': 'Option ELEM', 'esmshs': 'ES'}, 'geometry': {'type': 'Point', 'coordinates': [-122.314658339085, 47.7134302314171]}}, {'type': 'Feature', 'properties': {'OBJECTID': 3, 'schID': 264, 'schName': 'Rainier View', 'mapLabel': 'Rainier View', 'Status': 'ELEM', 'esmshs': 'ES'}, 'geometry': {'type': 'Point', 'coordinates': [-122.263172064205, 47.498863321283]}}, {'type': 'Feature', 'properties': {'OBJECTID': 4, 'schID': 203, 'schName': 'Arbor Heights', 'mapLabe

In [47]:
for feature in geo['features']:
    print(feature['properties'])

{'OBJECTID': 1, 'schID': 106, 'schName': 'Jane Addams', 'mapLabel': 'Jane Addams', 'Status': 'MS', 'esmshs': 'MS'}
{'OBJECTID': 2, 'schID': 292, 'schName': 'Hazel Wolf K-8', 'mapLabel': 'Hazel Wolf K-8', 'Status': 'Option ELEM', 'esmshs': 'ES'}
{'OBJECTID': 3, 'schID': 264, 'schName': 'Rainier View', 'mapLabel': 'Rainier View', 'Status': 'ELEM', 'esmshs': 'ES'}
{'OBJECTID': 4, 'schID': 203, 'schName': 'Arbor Heights', 'mapLabel': 'Arbor Heights', 'Status': 'ELEM', 'esmshs': 'ES'}
{'OBJECTID': 5, 'schID': 221, 'schName': 'Emerson', 'mapLabel': 'Emerson', 'Status': 'ELEM', 'esmshs': 'ES'}
{'OBJECTID': 6, 'schID': 267, 'schName': 'Roxhill', 'mapLabel': 'Roxhill', 'Status': 'Option School with continuous enrollment', 'esmshs': None}
{'OBJECTID': 7, 'schID': 291, 'schName': 'South Shore K-8', 'mapLabel': 'South Shore K-8', 'Status': 'Option ELEM', 'esmshs': 'ES'}
{'OBJECTID': 8, 'schID': 224, 'schName': 'FAUNTLEROY BLDG', 'mapLabel': 'FAUNTLEROY BLDG', 'Status': 'Closed_Leased_Land', 'esmsh

#### step 4: print the properties of each feature

- drilling into the features reveals a list of properties, containing school name and school ID which could be used to join to the csv file 

<blockquote>
    for feature in schoolsdict['features']:<br>
      ->  print(feature['properties'])<br>
              </blockquote>

In [48]:
coords = geo['features'][0]['geometry']['coordinates']
print(coords)

[-122.293009024934, 47.709944861847]


#### step 5: print the coordinates of the first school in the file 

- using the index slicing method we can focus on position 0, the first school in the source data. This reveals the Longitude and Latitude of the school 

<blockquote>
    coords = schoolsdict['features'][0]['geometry']['coordinates'] <br>
print(coords)</blockquote>


In [50]:
for i in range(len(geo['features'])):
     print(geo['features'][i]['geometry']['coordinates'])

[-122.293009024934, 47.709944861847]
[-122.314658339085, 47.7134302314171]
[-122.263172064205, 47.498863321283]
[-122.377587813408, 47.5097003606568]
[-122.258636146464, 47.5148204654414]
[-122.370558043205, 47.5180529159752]
[-122.272009121329, 47.5237429478992]
[-122.387871932427, 47.5219061704517]
[-122.265858983284, 47.5240707646718]
[-122.324690584152, 47.5236109016963]
[-122.274655654791, 47.5249891889941]
[-122.270919626841, 47.5252645773187]
[-122.348277286993, 47.5251774942707]
[-122.28903550729, 47.5291516293077]
[-122.366154798809, 47.5286320504163]
[-122.284906008453, 47.5307393432751]
[-122.366041530751, 47.5316539580682]
[-122.374927418131, 47.5328376591848]
[-122.296268298114, 47.5383331982934]
[-122.296268298114, 47.5383331982934]
[-122.358658082289, 47.5396083901531]
[-122.388498218919, 47.5402508857746]
[-122.276911759732, 47.5417432062751]
[-122.373492687303, 47.5417618672396]
[-122.268534249037, 47.5457573099578]
[-122.281980381889, 47.5463218607493]
[-122.296286979

#### step 6: extend this method using a for loop to print all the coordinates 

<blockquote>
for i in range(len(schoolsdict['features'])):<br>
      -> print(schoolsdict['features'][i]['geometry']['coordinates'])<br>
</blockquote>

In [84]:
latitude = []
for lat in range(len(geo['features'])):
    latitude.append(geo['features'][i]['geometry']['coordinates'][1])
        

In [86]:
longitude = []
for long in range(len(geo['features'])):
    longitude.append(geo['features'][i]['geometry']['coordinates'][0])

In [93]:
schID = []
for sid in range(len(geo['features'])):
    schID.append(geo['features'][sid]['properties']['schID'])

In [100]:
schools_location = pd.DataFrame(list(zip(schID,latitude,longitude)), columns=['schID', 'latitude', 'longitude'])

In [101]:
schools_location

Unnamed: 0,schID,latitude,longitude
0,106,47.685713,-122.282624
1,292,47.685713,-122.282624
2,264,47.685713,-122.282624
3,203,47.685713,-122.282624
4,221,47.685713,-122.282624
...,...,...,...
112,109,47.685713,-122.282624
113,971,47.685713,-122.282624
114,210,47.685713,-122.282624
115,365,47.685713,-122.282624


In [82]:
schools_location = pd.DataFrame(columns=['schID', 'longitude','latitude'])

In [83]:
schools_location

Unnamed: 0,schID,longitude,latitude


In [59]:
schools_location=[]
for item in geo:
    schools_location.append(item)

In [69]:
schools_location=pd.DataFrame(schools_location)


In [70]:
schools_location.head()

Unnamed: 0,0
0,type
1,name
2,crs
3,features


#### step 7: collect the School IDs and geolocation coordinates from the geojson file to a schoolslocation dataframe

- use a for loop to collect the data and convert it to a new schoolslocation dataframe 
- the final data frame should be 117 rows long and 3 columns wide
- column headers are schoolid, longitude and latitude 
- make sure you map the correct data to the correct column header
- preview your schoolslocation dataframe to ensure it shows the expected results

#### Step 8: combine the two data frames (using pd.concat) so that each school has a longitude and latitude value against it
- schoolslocation dataframe  +  schools dataframe
- resulting dataframe - > schoolslocinfo