# Neighborhood Analysis of a German City for Young Families
We'll investigate the neighborhood in a German City, including the venues and try to find the best spot possible to move for families with children. As measurement we choose the distance to child-important venues like schools, playgrounds and medical care.

## Description of the problem
Young families with children or plans for some are frequently in the situation to find a new place to call home, that will give their changing life as a family the best possible neighborhood.
Identification is not an easy task, as there are multiple factors to be included and not all information readily available. As a support for their decision making, we want to provide a geographical analysis of children-friendly neighboorhoods based on the distance to desired venues. For example the new home must be near a school, but also provide a playground for leasure time.
We'll focus on the German City Wermelskirchen out of curiosity.

## Description of the data
As dataset we're using a publicly available dataset of the German City Wuppertal, including it's districts and several population metrics.

https://de.wikipedia.org/wiki/Liste_der_Stadtbezirke_und_Stadtteile_von_Wuppertal

The original table is looking like:
<img src="pic1.png">

In the process of preperation we'll be translating and transforming the data. Please bear with me for now, but the relevant data is the following:
- 'Neighborhood'
- 'Borough'
- 'Residents'
- 'Size'
- 'Population_Density'
- 'Foreigner_Percentage'
- 'Unemployment_Rate'
- 'Livinghouses'
- 'Flats_thereof'
- 'Schools(Elementary_Schools)'
- 'Private_Cars'

Additionally we'll use the also publicy available GEOJSON data for these districts including their geographic boundaries.
Url: http://daten.wuppertal.de/Infrastruktur_Bauen_Wohnen/Quartiere_EPSG4326_JSON.json

<img src="pic2.png">

Furthermore we'll connect to the foursquare database and use data for venues from there.

## Methodology section

For the analysis we'll use several fairly standard python packages.

In [1]:
# Standards
import pandas as pd
import numpy as np

# Online interfaces for data fetching
import requests
from bs4 import BeautifulSoup

# Machine learning
from sklearn.cluster import KMeans

# Data visualization
import folium # map rendering library
import matplotlib.cm as cm
import matplotlib.colors as colors

# Geodata manipulation
from geopy.geocoders import Nominatim # convert an address into latitude and longitude values
import geojson
from shapely.geometry import shape

In [2]:
pd.set_option('display.max_colwidth', None)
pd.set_option('display.max_rows', None)

### Fetch and initially clean data from Wikipedia

In [3]:
d_wiki = pd.read_html("https://de.wikipedia.org/wiki/Liste_der_Stadtbezirke_und_Stadtteile_von_Wuppertal#Die_Wohnquartiere_Wuppertals_(Stand:_31._Dezember_2007)", decimal=',', thousands='.')[2]

We'll not need some of the columns.

In [4]:
print(d_wiki.columns)
d_wiki.drop(["Karte[4]", "Nr.", "Kommunale Zuordnung vor der Eingemeindung"], axis=1, inplace=True)

Index(['Karte[4]', 'Nr.', 'Statistisches Wohnquartier', 'Stadtbezirk',
       'Kommunale Zuordnung vor der Eingemeindung', 'Einwohner-zahl',
       'Fläche( km² )', 'Bevölkerungs-dichte(Einw. / km² )',
       'Ausländer-anteil (in %)', 'Arbeitslosen-quote (in %)', 'Wohn-gebäude',
       'darin Wohnungen', 'Schulen(davonGrundschulen)', 'PrivateKFZ'],
      dtype='object')


For everyone to understand the data, we're translating it from German to English.

In [5]:
d_wiki.rename(columns={"Statistisches Wohnquartier": "Neighborhood", "Stadtbezirk": "Borough", "Einwohner-zahl": "Residents", d_wiki.columns[3]: "Size", d_wiki.columns[4]: "Population_Density", d_wiki.columns[5]: "Foreigner_Percentage", d_wiki.columns[6]: "Unemployment_Rate", "Wohn-gebäude": "Livinghouses", d_wiki.columns[8]: "Flats_thereof", "Schulen(davonGrundschulen)": "Schools(Elementary_Schools)", "PrivateKFZ": "Private_Cars"}, inplace=True)
d_wiki.columns

Index(['Neighborhood', 'Borough', 'Residents', 'Size', 'Population_Density',
       'Foreigner_Percentage', 'Unemployment_Rate', 'Livinghouses',
       'Flats_thereof', 'Schools(Elementary_Schools)', 'Private_Cars'],
      dtype='object')

The schools column actually contains two data, the number of schools in total and the number of elementary schools there of. For the sake of simplicity we'll split this into two columns, by regextracting the elementary schools.

In [6]:
d_wiki["Elementary_Schools"] = d_wiki["Schools(Elementary_Schools)"].str.extract(r"\((.)\)")
d_wiki["Elementary_Schools"] = d_wiki["Elementary_Schools"].astype(str).str.replace("-", "0").astype(int)
d_wiki["Schools(Elementary_Schools)"] = d_wiki["Schools(Elementary_Schools)"].str.extract(r"(.*)(?=\()")
d_wiki.rename(columns={"Schools(Elementary_Schools)": "Schools"}, inplace=True)

Due to the split of the schools column it's of type object due to the "-".

In [7]:
d_wiki["Schools"] = d_wiki["Schools"].replace("- ","0").astype(int)

The percentage values miss the point.

In [8]:
d_wiki["Foreigner_Percentage"] = d_wiki["Foreigner_Percentage"]/10
d_wiki["Unemployment_Rate"] = d_wiki["Unemployment_Rate"]/100

The final table of neighborhoods looks like this:

In [9]:
d_wiki

Unnamed: 0,Neighborhood,Borough,Residents,Size,Population_Density,Foreigner_Percentage,Unemployment_Rate,Livinghouses,Flats_thereof,Schools,Private_Cars,Elementary_Schools
0,Elberfeld-Mitte,Elberfeld,5780,1.08,5352,2.51,0.0913,651,3718,2,1764,0
1,Nordstadt,Elberfeld,17269,1.18,14635,2.77,0.0903,1637,10675,8,4926,3
2,Ostersbaum,Elberfeld,14919,1.38,10811,2.46,0.0967,1416,8807,4,4877,3
3,Südstadt,Elberfeld,9640,0.59,16339,1.85,0.0766,771,6048,1,2977,1
4,Grifflenberg,Elberfeld,11696,4.45,2628,1.01,0.0321,1557,6289,1,5181,1
5,Friedrichsberg,Elberfeld,6449,2.39,2698,1.44,0.0678,654,3591,2,2396,2
6,Sonnborn,Elberfeld-West,4008,2.39,3929,1.33,0.0594,545,2360,1,1722,1
7,Varresbeck,Elberfeld-West,4376,2.59,1690,1.45,0.0375,804,2228,0,2215,0
8,Nützenberg,Elberfeld-West,5590,1.48,3777,1.75,0.0615,1034,3295,4,2491,3
9,Brill,Elberfeld-West,4414,1.22,3618,0.7,0.0283,693,2773,1,2470,0


### Fetch Geojson data for district boundaries

In [10]:
url = r'http://daten.wuppertal.de/Infrastruktur_Bauen_Wohnen/Quartiere_EPSG4326_JSON.json' # geojson file
geojson = requests.get(url).json()

Let's check, whether the naming of districts is the same as in the Wikipedia data:

In [11]:
geojson_df = pd.read_json("http://daten.wuppertal.de/Infrastruktur_Bauen_Wohnen/Quartiere_EPSG4326_JSON.json")
geojson_districts = []
for i in geojson_df["features"]:
    geojson_districts.append(i["properties"]["NAME"])
geojson_districts.sort()

In [12]:
data_pre = {"wiki_neighborhoods": d_wiki["Neighborhood"].sort_values(), "GEOJSON_neighborhoods": geojson_districts}
comparison = pd.DataFrame(data_pre).reset_index(drop=True)
comparison.loc[comparison["wiki_neighborhoods"]==comparison["GEOJSON_neighborhoods"], "HIT"] = True
comparison.loc[comparison["wiki_neighborhoods"]!=comparison["GEOJSON_neighborhoods"], "HIT"] = False

In [13]:
comparison[comparison["HIT"]==False]

Unnamed: 0,wiki_neighborhoods,GEOJSON_neighborhoods,HIT
10,Cronenberg-Mitte,Cronenberg,False
15,Elberfeld-Mitte,Elberfeld,False
18,Friedrich-Engels-Allee,Fr.-Engels-Allee,False
30,Industriestraße,Industriestr.,False
31,Jesinghauser Straße,Jesinghauser Str.,False
40,Nevigeser Straße,Nevigeser Str,False
52,Schenkstraße,Schenkstr.,False


There are a few cases, we have to harmonize in order to make both datasets comparable. We'll choose to go with the GEOJSON convention.

In [14]:
d_wiki = d_wiki.merge(comparison, left_on='Neighborhood', right_on="wiki_neighborhoods", how="left")
d_wiki.loc[d_wiki["HIT"]==False, 'Neighborhood'] = d_wiki.loc[d_wiki["HIT"]==False, 'GEOJSON_neighborhoods']
d_wiki.drop(["wiki_neighborhoods","GEOJSON_neighborhoods", "HIT"], inplace=True, axis=1)

### Fetch foursquare data for Wuppertal

We'll need to set some parameters for the foursquare API.

In [15]:
CLIENT_ID = "SBLODBLCGZNT4MIP04BPKDRDIMQETLUIWVMD0H5KAH5OHPWS"
CLIENT_SECRET = "0BF4FQMSHJ5BASM1USWS0FARALO04VMN2WM34WOGXKYJ5D3G"
version = "20200110"
base_url = "https://api.foursquare.com/v2/venues/search?"
LIMIT = 100
radius = 5000

Fetch the district centers:

In [16]:
collection = requests.get('http://daten.wuppertal.de/Infrastruktur_Bauen_Wohnen/Quartiere_EPSG4326_JSON.json').json()
features = collection["features"]
centers = {}
for feature in features:
    s = shape(feature["geometry"]).centroid
    centers[feature["properties"]["NAME"]] = (s.x, s.y)

In [17]:
centers2 = pd.DataFrame(centers).transpose().reset_index()
centers2.rename(columns={"index": "Neighborhood", 1: "Latitude", 0: "Longitude"}, inplace=True)
d_wiki = pd.merge(d_wiki, centers2, on="Neighborhood", how="left")

Using the function from the lab to fetch venues for given neighborhoods.

In [18]:
def getNearbyVenues(names, latitudes, longitudes, radius=500):
    
    venues_list=[]
    for name, lat, lng in zip(names, latitudes, longitudes):
        print(name)
            
        # create the API request URL
        url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
            CLIENT_ID, 
            CLIENT_SECRET, 
            version, 
            lat, 
            lng, 
            radius, 
            LIMIT)
            
        # make the GET request
        results = requests.get(url).json()["response"]['groups'][0]['items']
        
        # return only relevant information for each nearby venue
        venues_list.append([(
            name, 
            lat, 
            lng, 
            v['venue']['name'], 
            v['venue']['location']['lat'], 
            v['venue']['location']['lng'],  
            v['venue']['categories'][0]['name']) for v in results])

    nearby_venues = pd.DataFrame([item for venue_list in venues_list for item in venue_list])
    try:
        nearby_venues.columns = ['Neighborhood', 
                  'Neighborhood Latitude', 
                  'Neighborhood Longitude', 
                  'Venue', 
                  'Venue Latitude', 
                  'Venue Longitude', 
                  'Venue Category']
    except:
        print("Error encountered")
    
    return(nearby_venues)

In [19]:
#wuppertal_venues = getNearbyVenues(names=d_wiki["Neighborhood"], latitudes=d_wiki["Latitude"], longitudes=d_wiki["Longitude"])

In [20]:
#wuppertal_venues.to_csv("wuppertal_venues.csv")
wuppertal_venues = pd.read_csv("wuppertal_venues.csv")

In [21]:
category_count = wuppertal_venues[["Neighborhood", "Venue Category", "Venue"]].groupby(["Neighborhood", "Venue Category"], as_index=False).count()

In [22]:
category_count["Venue Category"].value_counts()

Supermarket                   14
Bus Stop                      12
Café                          11
Construction & Landscaping    10
Cable Car                      8
Bakery                         8
Plaza                          8
Hotel                          6
Fast Food Restaurant           5
Clothing Store                 5
German Restaurant              4
Electronics Store              4
Restaurant                     4
Pizza Place                    4
Gas Station                    4
Drugstore                      4
Intersection                   4
Business Service               3
BBQ Joint                      3
Train Station                  3
Chinese Restaurant             3
Gym                            3
Furniture / Home Store         3
Liquor Store                   3
Ice Cream Shop                 3
Bar                            2
Asian Restaurant               2
Playground                     2
Steakhouse                     2
Greek Restaurant               2
Gastropub 

In [23]:
kids_stuff = ["Ice Cream Shop", "Park", "Playground", "Soccer Field", "Rest Area", "Beach", "Sculpture Garden", "Botanical Garden", "Zoo Exhibit", "Rock Climbing Spot", "Forest", "Nature Preserve", "Lake", "Zoo", "Pet Store", "Garden"]
anti_kids_stuff = ["Intersection", "Liquor Store", "Gastropub", "Cocktail Bar", "Hookah Bar", "Bar", "Beer Bar", "Nightclub", "Light Rail Station", "Train Station", "Pub", "Lottery Retailer", "Bridge", "Hostel", "Smoke Shop"]

In [24]:
category_count.loc[category_count["Venue Category"].isin(kids_stuff), "Kids_friendly"] = 1
category_count.loc[category_count["Venue Category"].isin(anti_kids_stuff), "Kids_unfriendly"] = 1

In [25]:
category_count2 = category_count[["Neighborhood", "Kids_friendly", "Kids_unfriendly"]].groupby("Neighborhood").sum()
category_count2["Total"] = category_count2["Kids_friendly"] - category_count2["Kids_unfriendly"]
category_count2 = category_count2.sort_values(by="Total", ascending=False).reset_index(drop=False)
category_count2

Unnamed: 0,Neighborhood,Kids_friendly,Kids_unfriendly,Total
0,Zoo,2.0,0.0,2.0
1,Ostersbaum,2.0,0.0,2.0
2,Siebeneick,1.0,0.0,1.0
3,Kohlfurth,1.0,0.0,1.0
4,Höhe,1.0,0.0,1.0
5,Hilgershöhe,1.0,0.0,1.0
6,Hesselnberg,1.0,0.0,1.0
7,Nächstebreck-West,1.0,0.0,1.0
8,Rauental,1.0,0.0,1.0
9,Barmen-Mitte,1.0,0.0,1.0


### Initial analysis of the neighborhoods

First let's look into the wikipedia data on the neighborhoods, by aggregating on the boroughs and getting a feeling for their characteristics.

In [26]:
d_wiki[["Borough", "Neighborhood", "Size", "Residents", "Foreigner_Percentage", "Unemployment_Rate", 'Livinghouses',
       'Flats_thereof', 'Schools', 'Private_Cars', 'Elementary_Schools']].groupby('Borough').agg({'Neighborhood': "count", 
                         'Size':'sum', 
                         'Residents':'sum', 
                         'Foreigner_Percentage': "mean",
                         "Unemployment_Rate": "mean",
                         "Livinghouses": "sum",
                         "Flats_thereof": "sum",
                         "Schools": "sum",
                         "Elementary_Schools": "sum",
                         "Private_Cars": "sum"
                    }).sort_values(by="Residents", ascending=False)

Unnamed: 0_level_0,Neighborhood,Size,Residents,Foreigner_Percentage,Unemployment_Rate,Livinghouses,Flats_thereof,Schools,Elementary_Schools,Private_Cars
Borough,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
Elberfeld,6,11.07,65753,2.006667,0.0758,6686,39128,18,10,22121
Barmen,10,15.44,59410,1.466,0.05977,7403,34015,21,9,24162
Oberbarmen,5,12.57,42910,1.46,0.06664,5511,22830,13,8,17372
Uellendahl-Katernberg,7,25.91,38192,0.524286,0.025857,7713,19619,9,7,21174
Vohwinkel,9,20.42,31578,0.92,0.037956,5328,15732,12,6,14652
Elberfeld-West,7,11.74,27774,1.372857,0.0501,4243,15895,8,6,12784
Langerfeld-Beyenburg,9,29.4,25517,0.955556,0.058256,4560,13447,7,5,12552
Cronenberg,7,21.5,21846,0.49,0.025729,4996,11102,7,4,12881
Ronsdorf,6,16.05,21776,0.588333,0.035267,4141,11296,6,5,11819
Heckinghausen,3,5.66,21261,1.163333,0.054267,2555,12150,4,3,9014


Then let's shortly visualize the map including the neighborhoods, by drawing a choropleth-map based on the residents.

In [27]:
# create a numpy array of length 6 and has linear spacing from the minium total immigration to the maximum total immigration
threshold_scale = np.linspace(d_wiki['Residents'].min(),
                              d_wiki['Residents'].max(),
                              6, dtype=int)
threshold_scale = threshold_scale.tolist() # change the numpy array to a list
threshold_scale[-1] = threshold_scale[-1] + 1 # make sure that the last value of the list is greater than the maximum immigration

# let Folium determine the scale.
world_map = folium.Map(location=[51.256214, 7.150764], zoom_start=11)
world_map.choropleth(
    geo_data=geojson,
    data=d_wiki,
    columns=['Neighborhood', 'Residents'],
    key_on='feature.properties.NAME',
    threshold_scale=threshold_scale,
    fill_color='YlOrRd', 
    fill_opacity=0.7, 
    line_opacity=0.2,
    legend_name='Residents',
    reset=True
)
world_map



## k-means clustering of neighborhoods by venues

First we'll reduce the number of categories and aggregate them on a higher level.

In [28]:
wuppertal_venues["Venue Category"].value_counts()

Supermarket                   18
Café                          17
Bus Stop                      14
Bakery                        13
Construction & Landscaping    11
Clothing Store                10
Hotel                          9
Cable Car                      9
Plaza                          8
Drugstore                      8
Fast Food Restaurant           6
Restaurant                     6
Pizza Place                    5
Platform                       4
German Restaurant              4
Intersection                   4
Gas Station                    4
Electronics Store              4
Ice Cream Shop                 4
Italian Restaurant             4
Business Service               3
Park                           3
Cocktail Bar                   3
BBQ Joint                      3
Liquor Store                   3
Chinese Restaurant             3
Theater                        3
Furniture / Home Store         3
Coffee Shop                    3
Gym                            3
Train Stat

In [29]:
shops = ["Supermarket", "Drugstore", "Bakery", "Clothing Store", "Coffe Shop", "Ice Cream Shop", "Grocery Store", "Big Box Store", "Electronics Store", "Sporting Goods Shop", "Pharmacy", "Shopping Mall", "Convenience Store", "Organic Grocery", "Bookstore", "Paper / Office Supplies Store", "Hardware Store", "Stationery Store", "Automotive Shop", "Discount Store", "Hobby Shop", "Furniture / Home Store", "Farmers Market", "Miscellaneous Shop", "Shoe Store", "Flea Market", "Mobile Phone Shop", "Department Store", "Camera Store", "Photography Studio", "Tailor Shop", "Fruit & Vegetable Store"]
kids_unfriendly = ["Intersection", "Liquor Store", "Gastropub", "Cocktail Bar", "Hookah Bar", "Bar", "Beer Bar", "Nightclub", "Light Rail Station", "Train Station", "Pub", "Lottery Retailer", "Bridge", "Hostel", "Smoke Shop"]
imbiss = ["Café", "Fast Food Restaurant", "Pizza Place", "Diner", "Sandwich Place", "Burger Joint", "Snack Place", "Bistro", "Doner Restaurant", "Fried Chicken Joint", "Food & Drink Shop"]
restaurant = ["Japanese Restaurant", "Scandinavian Restaurant", "American Restaurant", "Steakhouse", "Asian Restaurant", "Falafel Restaurant", "Modern European Restaurant", "Mexican Restaurant", "Korean Restaurant", "Grilled Meat Restaurant", "Turkish Restaurant", "Spanish Restaurant", "Chinese Restaurant", "Greek Restaurant", "BBQ Joint", "German Restaurant", "Restaurant", "Italian Restaurant"]
activities = ["Park", "Plaza", "Pool", "Trail", "Soccer Field", "Golf Course", "Botanical Garden", "Sculpture Garden", "Climbing Gym", "Zoo", "Zoo Exhibit", "Water Park", "Theater", "Sports Club", "Lake"]
hotels = ["Hotel", "Hostel"]

wuppertal_venues.loc[wuppertal_venues["Venue Category"].isin(shops), "Category"] = "Rest"
wuppertal_venues.loc[wuppertal_venues["Venue Category"].isin(kids_unfriendly), "Category"] = "Unfriendly"
wuppertal_venues.loc[wuppertal_venues["Venue Category"].isin(imbiss), "Category"] = "Rest"
wuppertal_venues.loc[wuppertal_venues["Venue Category"].isin(restaurant), "Category"] = "Rest"
wuppertal_venues.loc[wuppertal_venues["Venue Category"].isin(activities), "Category"] = "Activity"
wuppertal_venues.loc[wuppertal_venues["Venue Category"].isin(hotels), "Category"] = "Rest"
wuppertal_venues.loc[wuppertal_venues["Category"]!=wuppertal_venues["Category"], "Category"] = "Rest"
wuppertal_venues

Unnamed: 0.1,Unnamed: 0,Neighborhood,Neighborhood Latitude,Neighborhood Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category,Category
0,0,Elberfeld,51.255741,7.146831,Peek & Cloppenburg,51.257811,7.145993,Clothing Store,Rest
1,1,Elberfeld,51.255741,7.146831,Milia's Coffee,51.256723,7.14715,Café,Rest
2,2,Elberfeld,51.255741,7.146831,Café Venezia,51.257475,7.146048,Café,Rest
3,3,Elberfeld,51.255741,7.146831,Mazzino,51.255883,7.146583,Coffee Shop,Rest
4,4,Elberfeld,51.255741,7.146831,Food Brother,51.257393,7.143594,Burger Joint,Rest
5,5,Elberfeld,51.255741,7.146831,mangimangi,51.259548,7.147168,Restaurant,Rest
6,6,Elberfeld,51.255741,7.146831,Historische Stadthalle,51.253032,7.142857,Town Hall,Rest
7,7,Elberfeld,51.255741,7.146831,Cafe & Bar Celona,51.256893,7.142629,Bar,Unfriendly
8,8,Elberfeld,51.255741,7.146831,Vapiano,51.2553,7.143174,Italian Restaurant,Rest
9,9,Elberfeld,51.255741,7.146831,Starbucks,51.25878,7.146749,Coffee Shop,Rest


In [30]:
# one hot encoding
wuppertal_onehot = pd.get_dummies(wuppertal_venues['Category'], prefix="", prefix_sep="")

# add neighborhood column back to dataframe
wuppertal_onehot['Neighborhood'] = wuppertal_venues['Neighborhood'] 

# move neighborhood column to the first column
fixed_columns = [wuppertal_onehot.columns[-1]] + list(wuppertal_onehot.columns[:-1])
wuppertal_onehot = wuppertal_onehot[fixed_columns]
wuppertal_onehot.drop("Rest", axis=1, inplace=True)

wuppertal_onehot.head()

Unnamed: 0,Neighborhood,Activity,Unfriendly
0,Elberfeld,0,0
1,Elberfeld,0,0
2,Elberfeld,0,0
3,Elberfeld,0,0
4,Elberfeld,0,0


In [31]:
wuppertal_grouped = wuppertal_onehot.groupby('Neighborhood').mean().reset_index()
wuppertal_grouped

Unnamed: 0,Neighborhood,Activity,Unfriendly
0,Arrenberg,0.0,0.181818
1,Barmen-Mitte,0.052632,0.0
2,Berghausen,0.0,0.0
3,Beyenburg-Mitte,0.0,0.0
4,Blombach-Lohsiepen,0.0,0.0
5,Blutfinke,1.0,0.0
6,Brill,0.0,0.0
7,Buchenhofen,0.0,0.0
8,Clausen,0.0,0.0
9,Cronenberg,0.25,0.125


In [124]:
wuppertal_grouped2 = pd.merge(d_wiki[["Neighborhood", "Unemployment_Rate", "Schools"]], wuppertal_grouped, on="Neighborhood", how="left")

## Adding livinghouses / population as marker for wealth

In [125]:
wealth_df = d_wiki[["Neighborhood", "Residents", "Livinghouses", "Population_Density"]].copy()
wealth_df["Houses_per_resident"] = wealth_df["Livinghouses"] / wealth_df["Residents"]
wealth_df.drop(["Residents", "Livinghouses"], axis=1, inplace=True)
wealth_df.sort_values(by="Houses_per_resident", ascending=False)

Unnamed: 0,Neighborhood,Population_Density,Houses_per_resident
62,Herbringhausen,135,0.368881
12,Buchenhofen,75,0.318182
35,Kohlfurth,489,0.317518
19,Siebeneick,467,0.311149
25,Industriestr.,280,0.306122
17,Beek,1700,0.302185
24,Lüntenbeck,843,0.296496
30,Küllenhahn,381,0.280607
34,Sudberg,645,0.266878
61,Beyenburg-Mitte,1231,0.257789


In [126]:
wuppertal_grouped3 = wuppertal_grouped2.merge(wealth_df, on="Neighborhood", how="left").fillna(0)

### Running the clustering algorythm

In [169]:
# set number of clusters
kclusters = 10

wuppertal_grouped_clustering = wuppertal_grouped3.drop('Neighborhood', 1)

# run k-means clustering
kmeans = KMeans(n_clusters=kclusters, random_state=0).fit(wuppertal_grouped_clustering)

# check cluster labels generated for each row in the dataframe
kmeans.labels_[0:10]

array([8, 5, 9, 5, 0, 0, 6, 7, 6, 6], dtype=int32)

### Merging and visualization
First let's merge the relevant into one frame and look at the means per cluster.

In [170]:
wuppertal_merged = pd.merge(d_wiki[["Neighborhood", "Latitude", "Longitude"]], wuppertal_grouped3, on="Neighborhood", how="left")

# merge manhattan_grouped with manhattan_data to add latitude/longitude for each neighborhood
#wuppertal_merged = wuppertal_merged.merge(wealth_df, on="Neighborhood", how="left")

# add clustering labels
wuppertal_merged.insert(0, 'Cluster Labels', kmeans.labels_)

wuppertal_merged.sort_values(by="Cluster Labels").head() # check the last columns!

Unnamed: 0,Cluster Labels,Neighborhood,Latitude,Longitude,Unemployment_Rate,Schools,Activity,Unfriendly,Population_Density,Houses_per_resident
50,0,Nächstebreck-West,51.297355,7.218246,0.0383,2,0.0,0.0,2527,0.173648
63,0,Ronsdorf-Mitte/Nord,51.233069,7.188286,0.0269,1,0.0,0.0,2165,0.181222
39,0,Clausen,51.273361,7.170245,0.0451,1,0.0,0.0,2485,0.16454
14,0,Uellendahl-Ost,51.285533,7.164066,0.0431,2,0.0,0.0,2686,0.183972
4,0,Grifflenberg,51.239131,7.157117,0.0321,1,0.0,0.5,2628,0.133122


In [171]:
df_merged = wuppertal_merged.drop(["Latitude", "Longitude"], 1).groupby("Cluster Labels").mean()
df_merged

Unnamed: 0_level_0,Unemployment_Rate,Schools,Activity,Unfriendly,Population_Density,Houses_per_resident
Cluster Labels,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
0,0.0477,1.5,0.05,0.1,2632.2,0.153769
1,0.0822,5.0,0.154762,0.083333,12795.5,0.107729
2,0.0914,2.5,0.088816,0.0625,8736.5,0.087302
3,0.025531,0.625,0.020833,0.083333,482.5,0.267042
4,0.073488,2.5,0.020833,0.099811,6637.25,0.115429
5,0.08345,4.5,0.0,0.2,15487.0,0.087387
6,0.050217,1.333333,0.155556,0.111111,3723.5,0.165549
7,0.035959,0.882353,0.102941,0.017157,1666.294118,0.195487
8,0.0682,1.8,0.364935,0.009091,4825.6,0.116707
9,0.0967,4.0,0.6,0.0,10811.0,0.094913


In [172]:
# create map
map_clusters = folium.Map(location=[51.256214, 7.150764], zoom_start=11)

# set color scheme for the clusters
x = np.arange(kclusters)
ys = [i + x + (i*x)**2 for i in range(kclusters)]
colors_array = cm.rainbow(np.linspace(0, 1, len(ys)))
rainbow = [colors.rgb2hex(i) for i in colors_array]

# add markers to the map
markers_colors = []
for lat, lon, poi, cluster in zip(wuppertal_merged['Latitude'], wuppertal_merged['Longitude'], wuppertal_merged['Neighborhood'], wuppertal_merged['Cluster Labels']):
    label = folium.Popup(str(poi) + ' Cluster ' + str(cluster), parse_html=True)
    folium.CircleMarker(
        [lat, lon],
        radius=5,
        popup=label,
        color=rainbow[cluster-1],
        fill=True,
        fill_color=rainbow[cluster-1],
        fill_opacity=0.7).add_to(map_clusters)
       
map_clusters

## Results section

For interpretation of our results, we'll rank the clusters in each category and then reflect the total score by using a weighted sum on these ranks.

In [174]:
df_merged

Unnamed: 0_level_0,Unemployment_Rate,Schools,Activity,Unfriendly,Population_Density,Houses_per_resident
Cluster Labels,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
0,0.0477,1.5,0.05,0.1,2632.2,0.153769
1,0.0822,5.0,0.154762,0.083333,12795.5,0.107729
2,0.0914,2.5,0.088816,0.0625,8736.5,0.087302
3,0.025531,0.625,0.020833,0.083333,482.5,0.267042
4,0.073488,2.5,0.020833,0.099811,6637.25,0.115429
5,0.08345,4.5,0.0,0.2,15487.0,0.087387
6,0.050217,1.333333,0.155556,0.111111,3723.5,0.165549
7,0.035959,0.882353,0.102941,0.017157,1666.294118,0.195487
8,0.0682,1.8,0.364935,0.009091,4825.6,0.116707
9,0.0967,4.0,0.6,0.0,10811.0,0.094913


In [181]:
df_ranked = df_merged.rank()

In [182]:
df_ranked["Schools"] = df_ranked["Schools"].max() - df_ranked["Schools"] + 1
df_ranked["Activity"] = df_ranked["Activity"].max() - df_ranked["Activity"] + 1
df_ranked["Houses_per_resident"] = df_ranked["Houses_per_resident"].max() - df_ranked["Houses_per_resident"] + 1

weights = [1.2, 0.5, 1, 0.5, 0.8, 0.8]
df_ranked["Total_rank"] = df_ranked.dot(weights).rank()

In [183]:
df_ranked.sort_values(by="Total_rank")

Unnamed: 0_level_0,Unemployment_Rate,Schools,Activity,Unfriendly,Population_Density,Houses_per_resident,Total_rank
Cluster Labels,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
7,2.0,9.0,5.0,3.0,2.0,2.0,1.0
3,1.0,10.0,8.5,5.5,1.0,1.0,2.0
8,5.0,6.0,2.0,2.0,5.0,5.0,3.0
6,4.0,8.0,3.0,9.0,4.0,3.0,4.0
0,3.0,7.0,7.0,8.0,3.0,4.0,5.0
9,10.0,3.0,1.0,1.0,8.0,8.0,6.0
1,7.0,1.0,4.0,5.5,9.0,7.0,7.0
4,6.0,4.5,8.5,7.0,6.0,6.0,8.0
2,9.0,4.5,6.0,4.0,7.0,10.0,9.0
5,8.0,2.0,10.0,10.0,10.0,9.0,10.0


## Discussion section

This final result is highly dependent on the weights used. The choice to weight unemployment the highest and schools the lowest is based on trying to find a distance to the city center in order to give kids the space and safety they need.

In [184]:
map_df = df_ranked["Total_rank"].reset_index(drop=False).merge(wuppertal_merged[["Neighborhood", "Latitude", "Longitude", "Cluster Labels"]], on="Cluster Labels", how="left")
map_df
map_df["Total_rank"].value_counts()

1.0     17
2.0     16
5.0     10
8.0      8
4.0      6
3.0      5
10.0     2
9.0      2
7.0      2
6.0      1
Name: Total_rank, dtype: int64

In [185]:
# let Folium determine the scale.
world_map = folium.Map(location=[51.256214, 7.150764], zoom_start=12)
world_map.choropleth(
    geo_data=geojson,
    data=map_df,
    columns=['Neighborhood', 'Total_rank'],
    key_on='feature.properties.NAME',
    fill_color="YlGnBu", 
    fill_opacity=0.6, 
    line_opacity=0.8,
    legend_name='Cluster',
    bins=[1,2,3,11],
    reset=True
)
world_map



## Conclusion section

We can conclude, that cluster 7 is the most suited for young families, based on our criteria.
This seem to be the neighborhoods in between the center and the outmost districts. This makes sense, as we'd expect the best balance of kids friendly activities and distance from nightlife, industrial zones and high unemployment neighborhoods.