In [1]:
import pandas as pd # library for data analsysis
import numpy as np
import pyproj
import folium # plotting library
from folium import plugins
from folium.plugins import HeatMap
import warnings
warnings.simplefilter("ignore")

In [2]:
def lonlat_to_xy(lon, lat):
    proj_latlon = pyproj.Proj(proj='latlong',datum='WGS84')
    proj_xy = pyproj.Proj(proj="utm", zone=33, datum='WGS84')
    xy = pyproj.transform(proj_latlon, proj_xy, lon, lat)
    return xy[0], xy[1]

def xy_to_lonlat(x, y):
    proj_latlon = pyproj.Proj(proj='latlong',datum='WGS84')
    proj_xy = pyproj.Proj(proj="utm", zone=33, datum='WGS84')
    lonlat = pyproj.transform(proj_xy, proj_latlon, x, y)
    return lonlat[0], lonlat[1]

In [3]:
spb_center_latitude = 59.938732
spb_center_longitude = 30.316229

In [4]:
df=pd.read_excel('parks.xlsx')
parks_latlon = df[df.id.notnull()][['vlat','vlng']]
lat = parks_latlon.vlat.tolist()
lng = parks_latlon.vlng.tolist()

In [63]:
df[df.id.notnull()].head()

Unnamed: 0.1,Unnamed: 0,N,lat,lng,id,name,vlat,vlng,address,distance
11,11,12,59.836981,30.213834,516016ace4b0dcf886fe2a11,Двор,59.840003,30.210297,Россия,390.0
12,12,12,59.836981,30.213834,53ceb280498e0eaa7cd0b3bf,Сквер у реки,59.833698,30.211293,"просп. Ветеранов, Санкт-Петербург, Россия",392.0
18,18,18,59.824717,30.31708,4ff47a14e4b0e8d00c65d6f8,Сквер вдоль Пулковского шоссе,59.827217,30.322655,"Пулковское Шоссе, Санкт-Петербург, Россия",418.0
19,19,18,59.824717,30.31708,56cb024d498e18ceb1235d6c,"сад ""Дубовая Роща""",59.824318,30.321292,Россия,239.0
25,25,24,59.849572,30.174286,4c5c467294fd0f47ac24c845,Южно-Приморский парк (Парк Ленина),59.847895,30.172891,Петергофское ш. (ул. Доблести и Петергофское ш...,202.0


In [82]:
map_spb = folium.Map(location=[spb_center_latitude,spb_center_longitude] , zoom_start=11)
HeatMap(list(zip(lat, lng))).add_to(map_spb)
map_spb

After visualizing the data, we can conclude:

Parks are scattered throughout the city, but there are areas without park areas.
We will now focus on identifying the worst areas.

For this we will use the k-means clustering method.

In [5]:
dfxys = pd.DataFrame (columns = ['x', 'y'])

dfbad = df[df.id.isnull()][['lat','lng']]
badlat = dfbad.lat.tolist()
badlng = dfbad.lng.tolist()

for index, row in dfbad.iterrows():
    x, y = lonlat_to_xy(row['lng'], row['lat'])
    dfxys = dfxys.append({'x': x, 'y': y}, ignore_index=True)
xys = dfxys[['x', 'y']].values

#### K mean clustering
Let's do the kmean clustering to see what will be the result.

In [60]:
from sklearn.cluster import KMeans

number_of_clusters = 23
kmeans = KMeans(n_clusters=number_of_clusters, random_state=0).fit(xys)

Let us now cluster those bad locations to create centers of zones containing bad locations. Those zones, their centers and addresses will be the final result of our analysis.

In [61]:
cluster_centers = [xy_to_lonlat(cc[0], cc[1]) for cc in kmeans.cluster_centers_]
cluster_centers

[(30.217030170380124, 59.91183302818294),
 (30.448955457308475, 59.9494336088092),
 (30.313365074575884, 60.048398326120406),
 (30.281247813485674, 59.87738390682817),
 (30.498413723181557, 59.881211904243116),
 (30.463266367538672, 60.020541701246934),
 (30.21155489983066, 59.84460336459657),
 (30.381898407407736, 59.975948230648605),
 (30.241579235665107, 59.99802976324803),
 (30.366844378700062, 59.83778510358248),
 (30.439349185674647, 59.9130796847425),
 (30.439987959369255, 59.84961507004598),
 (30.365890318234598, 59.899773608284434),
 (30.178607587921825, 59.87389672308785),
 (30.214142845982153, 59.955730958159776),
 (30.531009212351957, 59.95835841959436),
 (30.521677490117206, 59.91985601471596),
 (30.50784165764661, 59.99484983648163),
 (30.401887949008255, 60.03811899504618),
 (30.254516936896344, 60.042958566325275),
 (30.32283579636662, 60.01054380579133),
 (30.281965834364232, 59.83335268375879),
 (30.458327775742784, 59.986442260399656)]

Our clusters represent groupings of most of the candidate locations and cluster centers are placed nicely in the middle of the zones 'rich' with location candidates.

Addresses of those cluster centers will be a good starting point for exploring the neighborhoods to find the best possible location based on neighborhood specifics.

Let's see those zones on a city map with heatmap:

In [62]:
map_no_park =  folium.Map(location=[spb_center_latitude,spb_center_longitude] , zoom_start=11)
HeatMap(list(zip(badlat, badlng))).add_to(map_no_park)
for lon, lat in cluster_centers:
    folium.Circle([lat, lon], radius=500, color='red', fill=True, fill_opacity=0.25).add_to(map_no_park) 
map_no_park

### Results and discussion
This analysis shows that the situation in the city center is more favorable. Active construction is going on in new districts on the outskirts. Little attention is paid to landscaping. Parks are often sacrificed for lobbying the interests of developers.

The centers of problem areas were identified, in which there are no parks nearby. The data is correct. But for a real project, many other parameters must be taken into account. For example, the population in an area, in a park area.

### Conclusion 
This project can be reused for other cities, just think about changing clustering size to adapt to your city.

It's very far from being perfect, a lot of work can be done, other source of data can be found, but in the end the result seams to correlate with the real world, when we know the city, the area predicted seams correct.