 # Capstone Project - The Battle of the Neighborhoods (Week 2)

   Applied Data Science Capstone by IBM/Coursera

# Table of contents

- [Introduction: Business Problem](#introduction)
    
- [Data](#Data)
    
- [Methodology](#Methodology)
    
- [Analysis](#Analysis)
    
- [Results and Discussion](#results)
    
- [Conclusion](#Conclusion)
   

# Introduction: Business Problem
Cologne, the city the author lives in, attracts a large number of tourists, not least due to its famous cathedral, the trade fairs and conventions, such as the gamescom, and its vibrant party scene. For tourists, finding the right place to eat can be a challenge, though. German dishes include a lot of meat, often pork, which many people do not want to eat for health-related, religious, cultural or moral reasons. This is just one motive for giving tourists a good overview about what to eat where.

Thus, the goal I want to reach with this exercise is to give a simple recommendation to tourists in Cologne: in which district of the city will you find a large number or even concentration of which types of restaurants? Where to eat Mediterranean food, where to find German food, where to get fast food? The target audience are foreign tourists.

# Data 

Based on definition of our problem, factors that will influence our decission are:

number of existing restaurants in the neighborhood (any type of restaurant) number of and distance to Italian restaurants in the neighborhood, if any distance of neighborhood from city center We decided to use regularly spaced grid of locations, centered around city center, to define our neighborhoods.
I will, as requested by the assignment task, use foursquare data about restaurants in Cologne. Foursquare is a US tech company from New York focusing on location data. Their technology and data powers apps such as Apple's Maps, Uber, Twitter and many other household names. Here is an example of a restaurants in Cologne on foursquare: https://de.foursquare.com/v/sattgr%C3%BCn/5c33306cc824ae002c2b414c. I will use foursquare data such as the restaurant name, ID, location and category of food (vegetarian, Italian etc.).

Also, I will use the overview of districts/city parts of Cologne from Wikipedia: https://en.wikipedia.org/wiki/Districts_of_Cologne

We will use our data science powers to generate a few most promissing neighborhoods based on this criteria. Advantages of each area will then be clearly expressed so that best possible final location can be chosen by stakeholders.

# Methodology

In this section, I will describe the data analysis and how I used the data to yield the results.

Starting out, I scraped data from Wikipedia to create a dataframe with the city districts of Cologne: https://en.wikipedia.org/wiki/Districts_of_Cologne. For this, I used the pandas read function. I had to clean the resulting data frame in terms of unnecessary information or data that could not be handled in a data frame The result is a nice data frame:

In [1]:
from bs4 import BeautifulSoup
import requests
import pandas as pd

In [2]:
url = 'https://en.wikipedia.org/wiki/Districts_of_Cologne'
wikipedia_page = requests.get(url)
df_raw = pd.read_html(wikipedia_page.content, header=0)[1]
#df_new = df_raw[df_raw.Map!='NaN']
df_raw.pop('Map')
df_raw.pop('Coat')
df_raw.pop('Town Hall')
df_raw = df_raw.replace({'District':''}, regex=True)
df_raw['City district'] = df_raw['City district'].str.replace('\d+', '')
df_raw.head()


Unnamed: 0,City district,City parts,Area,Population1,Pop. density,District Councils
0,Köln-Innenstadt,"Altstadt-Nord, Altstadt-Süd, Deutz, Neustadt-N...",16.4 km²,127.033,7.746/km²,"Bezirksksamt Innenstadt Brückenstraße 19, D-50..."
1,Köln-Rodenkirchen,"Bayenthal, Godorf, Hahnwald, Immendorf, Marien...",54.6 km²,100.936,1.850/km²,"Bezirksamt Rodenkirchen Hauptstraße 85, D-5099..."
2,Köln-Lindenthal,"Braunsfeld, Junkersdorf, Klettenberg, Lindenth...",41.6 km²,137.552,3.308/km²,"Bezirksamt Lindenthal Aachener Straße 220, 509..."
3,Köln-Ehrenfeld,"Bickendorf, Bocklemünd/Mengenich, Ehrenfeld, N...",23.8 km²,103.621,4.348/km²,"Bezirksamt Ehrenfeld Venloer Straße 419 – 421,..."
4,Köln-Nippes,"Bilderstöckchen, Longerich, Mauenheim, Niehl, ...",31.8 km²,110.092,3.462/km²,"Bezirksamt NippesNeusser Straße 450,D-50733 Köln"


Then, I enabled geopy functions by installing the conda-forge geopy package. I used the nominatim function to add geospatial data to the data frame, that is the latitude and the longitude seen on the right side of the following table.

In [3]:
!pip install folium
import folium
from geopy.geocoders import Nominatim 
import requests



In [4]:
from geopy.exc import GeocoderTimedOut 
from geopy.geocoders import Nominatim 
import numpy as np 
# declare an empty list to store 
# latitude and longitude of values  
# of city column 
longitude = [] 
latitude = [] 
   
# function to find the coordinate 
# of a given city  
def findGeocode(City): 
       
    # try and catch is used to overcome 
    # the exception thrown by geolocator 
    # using geocodertimedout   
    try: 
          
        # Specify the user_agent as your 
        # app name it should not be none 
        geolocator = Nominatim(user_agent="your_app_name") 
          
        return geolocator.geocode(City) 
      
    except GeocoderTimedOut: 
          
        return findGeocode(City)     
  
# each value from city column 
# will be fetched and sent to 
# function find_geocode    
for i in (df_raw["City district"]): 
      
    if findGeocode(i) != None: 
           
        loc = findGeocode(i) 
          
        # coordinates returned from  
        # function is stored into 
        # two separate list 
        latitude.append(loc.latitude) 
        longitude.append(loc.longitude) 
       
    # if coordinate for a city not 
    # found, insert "NaN" indicating  
    # missing value  
    else: 
        latitude.append(np.nan) 
        longitude.append(np.nan) 

In [5]:
# now add this column to dataframe 
df_raw["Longitude"] = longitude 
df_raw["Latitude"] = latitude 
  
df_raw

Unnamed: 0,City district,City parts,Area,Population1,Pop. density,District Councils,Longitude,Latitude
0,Köln-Innenstadt,"Altstadt-Nord, Altstadt-Süd, Deutz, Neustadt-N...",16.4 km²,127.033,7.746/km²,"Bezirksksamt Innenstadt Brückenstraße 19, D-50...",6.959234,50.937328
1,Köln-Rodenkirchen,"Bayenthal, Godorf, Hahnwald, Immendorf, Marien...",54.6 km²,100.936,1.850/km²,"Bezirksamt Rodenkirchen Hauptstraße 85, D-5099...",6.969718,50.865622
2,Köln-Lindenthal,"Braunsfeld, Junkersdorf, Klettenberg, Lindenth...",41.6 km²,137.552,3.308/km²,"Bezirksamt Lindenthal Aachener Straße 220, 509...",6.871246,50.935935
3,Köln-Ehrenfeld,"Bickendorf, Bocklemünd/Mengenich, Ehrenfeld, N...",23.8 km²,103.621,4.348/km²,"Bezirksamt Ehrenfeld Venloer Straße 419 – 421,...",6.916529,50.951502
4,Köln-Nippes,"Bilderstöckchen, Longerich, Mauenheim, Niehl, ...",31.8 km²,110.092,3.462/km²,"Bezirksamt NippesNeusser Straße 450,D-50733 Köln",6.941777,50.958994
5,Köln-Chorweiler,"Blumenberg, Chorweiler, Esch/Auweiler, Fühling...",67.2 km²,80.870,1.204/km²,"Bezirksamt Chorweiler Pariser Platz 1, D-50765...",6.898034,51.021167
6,Köln-Porz,"Eil, Elsdorf, Ensen, Finkenberg, Gremberghoven...",78.8 km²,106.520,1.352/km²,"Bezirksamt PorzFriedrich-Ebert-Ufer 64–70, D-5...",6.999129,50.906705
7,Köln-Kalk,"Brück, Höhenberg, Humboldt/Gremberg, Kalk, Mer...",38.2 km²,108.330,2.841/km²,"Bezirksamt KalkKalker Hauptstraße 247–273,D-51...",7.005806,50.931923
8,Köln-Mülheim,"Buchforst, Buchheim, Dellbrück, Dünnwald, Flit...",52.2 km²,144.374,2.764/km²,"Bezirksamt Mülheim Wiener Platz 2a,D-51065 Köln",7.013526,50.958147
9,Cologne,,405.15 km2,1.019.3282,2.516/km2,2.516/km2,6.959974,50.938361


In [6]:
df_raw = df_raw[:-2]
df_raw

Unnamed: 0,City district,City parts,Area,Population1,Pop. density,District Councils,Longitude,Latitude
0,Köln-Innenstadt,"Altstadt-Nord, Altstadt-Süd, Deutz, Neustadt-N...",16.4 km²,127.033,7.746/km²,"Bezirksksamt Innenstadt Brückenstraße 19, D-50...",6.959234,50.937328
1,Köln-Rodenkirchen,"Bayenthal, Godorf, Hahnwald, Immendorf, Marien...",54.6 km²,100.936,1.850/km²,"Bezirksamt Rodenkirchen Hauptstraße 85, D-5099...",6.969718,50.865622
2,Köln-Lindenthal,"Braunsfeld, Junkersdorf, Klettenberg, Lindenth...",41.6 km²,137.552,3.308/km²,"Bezirksamt Lindenthal Aachener Straße 220, 509...",6.871246,50.935935
3,Köln-Ehrenfeld,"Bickendorf, Bocklemünd/Mengenich, Ehrenfeld, N...",23.8 km²,103.621,4.348/km²,"Bezirksamt Ehrenfeld Venloer Straße 419 – 421,...",6.916529,50.951502
4,Köln-Nippes,"Bilderstöckchen, Longerich, Mauenheim, Niehl, ...",31.8 km²,110.092,3.462/km²,"Bezirksamt NippesNeusser Straße 450,D-50733 Köln",6.941777,50.958994
5,Köln-Chorweiler,"Blumenberg, Chorweiler, Esch/Auweiler, Fühling...",67.2 km²,80.87,1.204/km²,"Bezirksamt Chorweiler Pariser Platz 1, D-50765...",6.898034,51.021167
6,Köln-Porz,"Eil, Elsdorf, Ensen, Finkenberg, Gremberghoven...",78.8 km²,106.52,1.352/km²,"Bezirksamt PorzFriedrich-Ebert-Ufer 64–70, D-5...",6.999129,50.906705
7,Köln-Kalk,"Brück, Höhenberg, Humboldt/Gremberg, Kalk, Mer...",38.2 km²,108.33,2.841/km²,"Bezirksamt KalkKalker Hauptstraße 247–273,D-51...",7.005806,50.931923
8,Köln-Mülheim,"Buchforst, Buchheim, Dellbrück, Dünnwald, Flit...",52.2 km²,144.374,2.764/km²,"Bezirksamt Mülheim Wiener Platz 2a,D-51065 Köln",7.013526,50.958147


Using the folium package and my data frame, I then created a map with the nine city districs on it.

In [7]:
!pip install folium



In [8]:
address = 'Cologne'

geolocator = Nominatim(user_agent="cologne_explorer")
location = geolocator.geocode(address)
latitude = location.latitude
longitude = location.longitude
print('The geograpical coordinate of cologne are {}, {}.'.format(latitude, longitude))

The geograpical coordinate of cologne are 50.938361, 6.959974.


In [9]:
import folium

# create map of Toronto using latitude and longitude values
map_Cologne = folium.Map(location=[latitude, longitude], zoom_start=10)

# add markers to map
for lat, lng, Citydistrict, Cityparts in zip(df_raw['Latitude'], df_raw['Longitude'], df_raw['City district'], df_raw['City parts']):
    label = '{}, {}'.format('City district', 'City parts')
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=5,
        popup=label,
        color='blue',
        fill=True,
        fill_color='#3186cc',
        fill_opacity=0.7,
        ).add_to(map_Cologne)  
    
map_Cologne

In [10]:
CLIENT_ID = 'EOADWOWQSWS1H5HPSC5RLSCEW00JILXCK1SOSJGRFGDP13W3' # your Foursquare ID
CLIENT_SECRET = 'TXL04PLMLGSGP4DOQTC5IPQBUKL2GXUBOU1U5GRNLOPLEV54' # your Foursquare Secret
VERSION = '20180604'
LIMIT = 100

In [11]:
def getNearbyVenues(names, latitudes, longitudes, radius=500):
    
    venues_list=[]
    for name, lat, lng in zip(names, latitudes, longitudes):
        print(name)
        # create the API request URL
        url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
            CLIENT_ID, 
            CLIENT_SECRET, 
            VERSION, 
            lat, 
            lng, 
            radius, 
            LIMIT)
            
        # make the GET request
        results = requests.get(url).json()["response"]['groups'][0]['items']
        
        # return only relevant information for each nearby venue
        venues_list.append([(
            name, 
            lat, 
            lng, 
            v['venue']['name'], 
            v['venue']['location']['lat'], 
            v['venue']['location']['lng'],  
            v['venue']['categories'][0]['name']) for v in results])

    nearby_venues = pd.DataFrame([item for venue_list in venues_list for item in venue_list])
    nearby_venues.columns = ['City district', 
                  'City district Latitude', 
                  'City district Longitude', 
                  'Venue', 
                  'Venue Latitude', 
                  'Venue Longitude', 
                  'Venue Category']
    
    return(nearby_venues)

In [12]:
cologne_venues = getNearbyVenues(names=df_raw['City district'],
                                   latitudes=df_raw['Latitude'],
                                   longitudes=df_raw['Longitude'])

  Köln-Innenstadt
  Köln-Rodenkirchen
  Köln-Lindenthal
  Köln-Ehrenfeld
  Köln-Nippes
  Köln-Chorweiler
  Köln-Porz
  Köln-Kalk
  Köln-Mülheim


In [13]:
print(cologne_venues.shape)
cologne_venues.head()

(229, 7)


Unnamed: 0,City district,City district Latitude,City district Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
0,Köln-Innenstadt,50.937328,6.959234,Craftbeer Corner,50.937222,6.958928,Beer Bar
1,Köln-Innenstadt,50.937328,6.959234,Papa Joe's Jazzlokal,50.937882,6.962241,Jazz Club
2,Köln-Innenstadt,50.937328,6.959234,LEGO Store,50.937042,6.956564,Toy / Game Store
3,Köln-Innenstadt,50.937328,6.959234,Alter Markt,50.938623,6.96007,Plaza
4,Köln-Innenstadt,50.937328,6.959234,Heumarkt,50.936161,6.960461,Plaza


# Analysis 
Let's perform some basic explanatory data analysis and derive some additional info from our raw data. First let's count the number of restaurants in every area candidate:

In [14]:
cologne_venues.groupby('City district').count()

Unnamed: 0_level_0,City district Latitude,City district Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
City district,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
Köln-Chorweiler,14,14,14,14,14,14
Köln-Ehrenfeld,62,62,62,62,62,62
Köln-Innenstadt,89,89,89,89,89,89
Köln-Kalk,4,4,4,4,4,4
Köln-Lindenthal,18,18,18,18,18,18
Köln-Mülheim,23,23,23,23,23,23
Köln-Nippes,8,8,8,8,8,8
Köln-Porz,5,5,5,5,5,5
Köln-Rodenkirchen,6,6,6,6,6,6


In [15]:
# find out how many unique categories can be curated from all the returned venues

print('There are {} uniques categories.'.format(len(cologne_venues['Venue Category'].unique())))

There are 111 uniques categories.


In [16]:
# one hot encoding
cologne_onehot = pd.get_dummies(cologne_venues[['Venue Category']], prefix="", prefix_sep="")

# add neighborhood column back to dataframe
cologne_onehot['City district'] = cologne_venues['City district'] 

# move neighborhood column to the first column
fixed_columns = [cologne_onehot.columns[-1]] + list(cologne_onehot.columns[:-1])
cologne_onehot = cologne_onehot[fixed_columns]

cologne_onehot.head()

Unnamed: 0,City district,Art Gallery,Art Museum,Athletics & Sports,Auto Garage,BBQ Joint,Bakery,Bank,Bar,Baseball Stadium,...,Tapas Restaurant,Tennis Court,Thai Restaurant,Theater,Toy / Game Store,Tram Station,Trattoria/Osteria,Turkish Restaurant,Vegetarian / Vegan Restaurant,Wine Bar
0,Köln-Innenstadt,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,Köln-Innenstadt,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,Köln-Innenstadt,0,0,0,0,0,0,0,0,0,...,0,0,0,0,1,0,0,0,0,0
3,Köln-Innenstadt,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,Köln-Innenstadt,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [17]:

cologne_grouped = cologne_onehot.groupby('City district').mean().reset_index()
cologne_grouped

Unnamed: 0,City district,Art Gallery,Art Museum,Athletics & Sports,Auto Garage,BBQ Joint,Bakery,Bank,Bar,Baseball Stadium,...,Tapas Restaurant,Tennis Court,Thai Restaurant,Theater,Toy / Game Store,Tram Station,Trattoria/Osteria,Turkish Restaurant,Vegetarian / Vegan Restaurant,Wine Bar
0,Köln-Chorweiler,0.0,0.0,0.0,0.0,0.0,0.071429,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.071429,0.0,0.0,0.0,0.0,0.0
1,Köln-Ehrenfeld,0.016129,0.0,0.0,0.0,0.0,0.0,0.016129,0.096774,0.0,...,0.032258,0.0,0.016129,0.016129,0.0,0.0,0.0,0.016129,0.0,0.0
2,Köln-Innenstadt,0.0,0.044944,0.0,0.0,0.0,0.011236,0.0,0.0,0.0,...,0.011236,0.0,0.0,0.022472,0.022472,0.0,0.0,0.0,0.011236,0.011236
3,Köln-Kalk,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.25,0.0,0.0,0.0,0.0,0.0,0.0
4,Köln-Lindenthal,0.0,0.0,0.111111,0.0,0.0,0.0,0.0,0.0,0.055556,...,0.0,0.0,0.0,0.0,0.0,0.111111,0.055556,0.0,0.0,0.0
5,Köln-Mülheim,0.0,0.0,0.0,0.0,0.0,0.086957,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.043478,0.0,0.0
6,Köln-Nippes,0.0,0.0,0.0,0.0,0.0,0.125,0.0,0.0,0.0,...,0.0,0.125,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
7,Köln-Porz,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
8,Köln-Rodenkirchen,0.0,0.0,0.0,0.166667,0.166667,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


I used this information to create a data frame in which you can see the most common restaurant venue types for each city district.

In [18]:
filtered_columns = ['City district'] + [col for col in cologne_grouped.columns if col.endswith('Restaurant')]
dataframe_filtered = cologne_grouped.loc[:, filtered_columns].head()
dataframe_filtered

Unnamed: 0,City district,Bavarian Restaurant,Chinese Restaurant,Doner Restaurant,Eastern European Restaurant,Ethiopian Restaurant,Falafel Restaurant,Fast Food Restaurant,German Restaurant,Greek Restaurant,...,Portuguese Restaurant,Restaurant,Scandinavian Restaurant,Seafood Restaurant,South American Restaurant,Sushi Restaurant,Tapas Restaurant,Thai Restaurant,Turkish Restaurant,Vegetarian / Vegan Restaurant
0,Köln-Chorweiler,0.0,0.0,0.0,0.0,0.0,0.0,0.071429,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.071429,0.0,0.0,0.0,0.0
1,Köln-Ehrenfeld,0.0,0.0,0.0,0.0,0.016129,0.016129,0.0,0.016129,0.0,...,0.016129,0.032258,0.0,0.0,0.0,0.0,0.032258,0.016129,0.016129,0.0
2,Köln-Innenstadt,0.011236,0.011236,0.0,0.011236,0.0,0.0,0.011236,0.011236,0.011236,...,0.0,0.011236,0.0,0.011236,0.011236,0.0,0.011236,0.0,0.0,0.011236
3,Köln-Kalk,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.25,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,Köln-Lindenthal,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


Let's create here a function to retreive top 10 restaurents in city Cologne

In [19]:
def return_most_common_venues(row, num_top_venues):
    row_categories = row.iloc[1:]
    row_categories_sorted = row_categories.sort_values(ascending=False)
    
    return row_categories_sorted.index.values[0:num_top_venues]

In [20]:
num_top_restaurant = 10
import numpy as np
indicators = ['st', 'nd', 'rd']

# create columns according to number of top venues
columns = ['City district']
for ind in np.arange(num_top_restaurant):
    try:
        columns.append('{}{} Most Common Venue'.format(ind+1, indicators[ind]))
    except:
        columns.append('{}th Most Common Venue'.format(ind+1))

# create a new dataframe
neighborhoods_venues_sorted = pd.DataFrame(columns=columns)
neighborhoods_venues_sorted['City district'] = dataframe_filtered['City district']

for ind in np.arange(dataframe_filtered.shape[0]):
    neighborhoods_venues_sorted.iloc[ind, 1:] = return_most_common_venues(dataframe_filtered.iloc[ind, :], num_top_restaurant)

neighborhoods_venues_sorted.head()

Unnamed: 0,City district,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
0,Köln-Chorweiler,Sushi Restaurant,Fast Food Restaurant,Vegetarian / Vegan Restaurant,Kebab Restaurant,Chinese Restaurant,Doner Restaurant,Eastern European Restaurant,Ethiopian Restaurant,Falafel Restaurant,German Restaurant
1,Köln-Ehrenfeld,Italian Restaurant,Restaurant,Tapas Restaurant,Kebab Restaurant,Portuguese Restaurant,Turkish Restaurant,Falafel Restaurant,Modern European Restaurant,Lebanese Restaurant,German Restaurant
2,Köln-Innenstadt,Italian Restaurant,Indian Restaurant,Vegetarian / Vegan Restaurant,Mexican Restaurant,Chinese Restaurant,Eastern European Restaurant,Fast Food Restaurant,German Restaurant,Greek Restaurant,Mediterranean Restaurant
3,Köln-Kalk,Greek Restaurant,Vegetarian / Vegan Restaurant,Turkish Restaurant,Chinese Restaurant,Doner Restaurant,Eastern European Restaurant,Ethiopian Restaurant,Falafel Restaurant,Fast Food Restaurant,German Restaurant
4,Köln-Lindenthal,Italian Restaurant,Vegetarian / Vegan Restaurant,Turkish Restaurant,Chinese Restaurant,Doner Restaurant,Eastern European Restaurant,Ethiopian Restaurant,Falafel Restaurant,Fast Food Restaurant,German Restaurant


What we see in the table are the city districts and their most common venues, and they now have been assigned five different cluster labels from 0 to 4.

We can now use the cluster labels to show the city districts marked with a cluster-specific color on a map, again using folium:

In [21]:
from sklearn.cluster import KMeans
# set number of clusters
kclusters = 5

cologne_grouped_clustering = dataframe_filtered.drop('City district', 1)

# run k-means clustering
kmeans = KMeans(n_clusters=kclusters, random_state=0).fit(cologne_grouped_clustering)

# check cluster labels generated for each row in the dataframe
kmeans.labels_[0:10] 

array([2, 3, 4, 1, 0], dtype=int32)

In [22]:
# add clustering labels
neighborhoods_venues_sorted.insert(0, 'Cluster Labels', kmeans.labels_)

cologne_merged = df_raw

# merge toronto_grouped with toronto_data to add latitude/longitude for each neighborhood
cologne_merged = cologne_merged.join(neighborhoods_venues_sorted.set_index('City district'), on='City district')

cologne_merged.head()

Unnamed: 0,City district,City parts,Area,Population1,Pop. density,District Councils,Longitude,Latitude,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
0,Köln-Innenstadt,"Altstadt-Nord, Altstadt-Süd, Deutz, Neustadt-N...",16.4 km²,127.033,7.746/km²,"Bezirksksamt Innenstadt Brückenstraße 19, D-50...",6.959234,50.937328,4.0,Italian Restaurant,Indian Restaurant,Vegetarian / Vegan Restaurant,Mexican Restaurant,Chinese Restaurant,Eastern European Restaurant,Fast Food Restaurant,German Restaurant,Greek Restaurant,Mediterranean Restaurant
1,Köln-Rodenkirchen,"Bayenthal, Godorf, Hahnwald, Immendorf, Marien...",54.6 km²,100.936,1.850/km²,"Bezirksamt Rodenkirchen Hauptstraße 85, D-5099...",6.969718,50.865622,,,,,,,,,,,
2,Köln-Lindenthal,"Braunsfeld, Junkersdorf, Klettenberg, Lindenth...",41.6 km²,137.552,3.308/km²,"Bezirksamt Lindenthal Aachener Straße 220, 509...",6.871246,50.935935,0.0,Italian Restaurant,Vegetarian / Vegan Restaurant,Turkish Restaurant,Chinese Restaurant,Doner Restaurant,Eastern European Restaurant,Ethiopian Restaurant,Falafel Restaurant,Fast Food Restaurant,German Restaurant
3,Köln-Ehrenfeld,"Bickendorf, Bocklemünd/Mengenich, Ehrenfeld, N...",23.8 km²,103.621,4.348/km²,"Bezirksamt Ehrenfeld Venloer Straße 419 – 421,...",6.916529,50.951502,3.0,Italian Restaurant,Restaurant,Tapas Restaurant,Kebab Restaurant,Portuguese Restaurant,Turkish Restaurant,Falafel Restaurant,Modern European Restaurant,Lebanese Restaurant,German Restaurant
4,Köln-Nippes,"Bilderstöckchen, Longerich, Mauenheim, Niehl, ...",31.8 km²,110.092,3.462/km²,"Bezirksamt NippesNeusser Straße 450,D-50733 Köln",6.941777,50.958994,,,,,,,,,,,


In [23]:
cologne_merged.dropna()

Unnamed: 0,City district,City parts,Area,Population1,Pop. density,District Councils,Longitude,Latitude,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
0,Köln-Innenstadt,"Altstadt-Nord, Altstadt-Süd, Deutz, Neustadt-N...",16.4 km²,127.033,7.746/km²,"Bezirksksamt Innenstadt Brückenstraße 19, D-50...",6.959234,50.937328,4.0,Italian Restaurant,Indian Restaurant,Vegetarian / Vegan Restaurant,Mexican Restaurant,Chinese Restaurant,Eastern European Restaurant,Fast Food Restaurant,German Restaurant,Greek Restaurant,Mediterranean Restaurant
2,Köln-Lindenthal,"Braunsfeld, Junkersdorf, Klettenberg, Lindenth...",41.6 km²,137.552,3.308/km²,"Bezirksamt Lindenthal Aachener Straße 220, 509...",6.871246,50.935935,0.0,Italian Restaurant,Vegetarian / Vegan Restaurant,Turkish Restaurant,Chinese Restaurant,Doner Restaurant,Eastern European Restaurant,Ethiopian Restaurant,Falafel Restaurant,Fast Food Restaurant,German Restaurant
3,Köln-Ehrenfeld,"Bickendorf, Bocklemünd/Mengenich, Ehrenfeld, N...",23.8 km²,103.621,4.348/km²,"Bezirksamt Ehrenfeld Venloer Straße 419 – 421,...",6.916529,50.951502,3.0,Italian Restaurant,Restaurant,Tapas Restaurant,Kebab Restaurant,Portuguese Restaurant,Turkish Restaurant,Falafel Restaurant,Modern European Restaurant,Lebanese Restaurant,German Restaurant
5,Köln-Chorweiler,"Blumenberg, Chorweiler, Esch/Auweiler, Fühling...",67.2 km²,80.87,1.204/km²,"Bezirksamt Chorweiler Pariser Platz 1, D-50765...",6.898034,51.021167,2.0,Sushi Restaurant,Fast Food Restaurant,Vegetarian / Vegan Restaurant,Kebab Restaurant,Chinese Restaurant,Doner Restaurant,Eastern European Restaurant,Ethiopian Restaurant,Falafel Restaurant,German Restaurant
7,Köln-Kalk,"Brück, Höhenberg, Humboldt/Gremberg, Kalk, Mer...",38.2 km²,108.33,2.841/km²,"Bezirksamt KalkKalker Hauptstraße 247–273,D-51...",7.005806,50.931923,1.0,Greek Restaurant,Vegetarian / Vegan Restaurant,Turkish Restaurant,Chinese Restaurant,Doner Restaurant,Eastern European Restaurant,Ethiopian Restaurant,Falafel Restaurant,Fast Food Restaurant,German Restaurant


In [24]:
neighborhoods_venues_sorted.head()

Unnamed: 0,Cluster Labels,City district,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
0,2,Köln-Chorweiler,Sushi Restaurant,Fast Food Restaurant,Vegetarian / Vegan Restaurant,Kebab Restaurant,Chinese Restaurant,Doner Restaurant,Eastern European Restaurant,Ethiopian Restaurant,Falafel Restaurant,German Restaurant
1,3,Köln-Ehrenfeld,Italian Restaurant,Restaurant,Tapas Restaurant,Kebab Restaurant,Portuguese Restaurant,Turkish Restaurant,Falafel Restaurant,Modern European Restaurant,Lebanese Restaurant,German Restaurant
2,4,Köln-Innenstadt,Italian Restaurant,Indian Restaurant,Vegetarian / Vegan Restaurant,Mexican Restaurant,Chinese Restaurant,Eastern European Restaurant,Fast Food Restaurant,German Restaurant,Greek Restaurant,Mediterranean Restaurant
3,1,Köln-Kalk,Greek Restaurant,Vegetarian / Vegan Restaurant,Turkish Restaurant,Chinese Restaurant,Doner Restaurant,Eastern European Restaurant,Ethiopian Restaurant,Falafel Restaurant,Fast Food Restaurant,German Restaurant
4,0,Köln-Lindenthal,Italian Restaurant,Vegetarian / Vegan Restaurant,Turkish Restaurant,Chinese Restaurant,Doner Restaurant,Eastern European Restaurant,Ethiopian Restaurant,Falafel Restaurant,Fast Food Restaurant,German Restaurant


What we see in the table are the city districts and their most common venues, and they now have been assigned five different cluster labels from 0 to 4.

We can now use the cluster labels to show the city districts marked with a cluster-specific color on a map, again using folium:

In [25]:
address = 'Köln Innenstadt'

geolocator = Nominatim(user_agent="ny_explorer")
location = geolocator.geocode(address)
latitude = location.latitude
longitude = location.longitude
print('The geograpical coordinate of cologne are {}, {}.'.format(latitude, longitude))

The geograpical coordinate of cologne are 50.93732845, 6.959234323073302.


In [27]:
# create map
# Matplotlib and associated plotting modules
import matplotlib.cm as cm
import matplotlib.colors as colors


map_clusters = folium.Map(location=[latitude, longitude], zoom_start=10)

# set color scheme for the clusters
x = np.arange(kclusters)
ys = [i + x + (i*x)**2 for i in range(kclusters)]
colors_array = cm.rainbow(np.linspace(0, 1, len(ys)))
rainbow = [colors.rgb2hex(i) for i in colors_array]

# add markers to the map
markers_colors = []
for lat, lon, Citydistrict, ClusterLabels in zip(cologne_merged['Latitude'], cologne_merged['Longitude'], cologne_merged['City district'], cologne_merged['Cluster Labels']):
    label = folium.Popup(str(Citydistrict) + ' Cluster ' + str(ClusterLabels), parse_html=True)
    #label = '{}, {}'.format(Citydistrict, ClusterLabels)
    folium.CircleMarker(
        [lat, lon],
        radius=5,
        popup=label,
        #color=rainbow[Cluster-1],
        fill=True,
        #fill_color=rainbow[Cluster-1],
        fill_opacity=0.7).add_to(map_clusters)
       
map_clusters

You will see nine bubbles for the nine city districts, with five different colors for the five different clusters. If you have trouble counting to five here, look for a small blue dot on the upper part of the map.

Now, what is the final result of this exercise? We now can show two clusters of restaurant type concentrations for the city of Cologne, which I named according to the restaurant concentration the data shows.

Cluster 1 - Italian Restaurant

In [32]:
cologne_merged.loc[cologne_merged['Cluster Labels'] == 0, cologne_merged.columns[[1] + list(range(5, cologne_merged.shape[1]))]]

Unnamed: 0,City parts,District Councils,Longitude,Latitude,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
2,"Braunsfeld, Junkersdorf, Klettenberg, Lindenth...","Bezirksamt Lindenthal Aachener Straße 220, 509...",6.871246,50.935935,0.0,Italian Restaurant,Vegetarian / Vegan Restaurant,Turkish Restaurant,Chinese Restaurant,Doner Restaurant,Eastern European Restaurant,Ethiopian Restaurant,Falafel Restaurant,Fast Food Restaurant,German Restaurant


Cluster 1 - Greek Restaurant

In [30]:
cologne_merged.loc[cologne_merged['Cluster Labels'] == 1, cologne_merged.columns[[1] + list(range(5, cologne_merged.shape[1]))]]

Unnamed: 0,City parts,District Councils,Longitude,Latitude,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
7,"Brück, Höhenberg, Humboldt/Gremberg, Kalk, Mer...","Bezirksamt KalkKalker Hauptstraße 247–273,D-51...",7.005806,50.931923,1.0,Greek Restaurant,Vegetarian / Vegan Restaurant,Turkish Restaurant,Chinese Restaurant,Doner Restaurant,Eastern European Restaurant,Ethiopian Restaurant,Falafel Restaurant,Fast Food Restaurant,German Restaurant


In [33]:
cologne_merged.loc[cologne_merged['Cluster Labels'] == 2, cologne_merged.columns[[1] + list(range(5, cologne_merged.shape[1]))]]

Unnamed: 0,City parts,District Councils,Longitude,Latitude,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
5,"Blumenberg, Chorweiler, Esch/Auweiler, Fühling...","Bezirksamt Chorweiler Pariser Platz 1, D-50765...",6.898034,51.021167,2.0,Sushi Restaurant,Fast Food Restaurant,Vegetarian / Vegan Restaurant,Kebab Restaurant,Chinese Restaurant,Doner Restaurant,Eastern European Restaurant,Ethiopian Restaurant,Falafel Restaurant,German Restaurant


In [34]:
cologne_merged.loc[cologne_merged['Cluster Labels'] == 3, cologne_merged.columns[[1] + list(range(5, cologne_merged.shape[1]))]]

Unnamed: 0,City parts,District Councils,Longitude,Latitude,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
3,"Bickendorf, Bocklemünd/Mengenich, Ehrenfeld, N...","Bezirksamt Ehrenfeld Venloer Straße 419 – 421,...",6.916529,50.951502,3.0,Italian Restaurant,Restaurant,Tapas Restaurant,Kebab Restaurant,Portuguese Restaurant,Turkish Restaurant,Falafel Restaurant,Modern European Restaurant,Lebanese Restaurant,German Restaurant


Unnamed: 0,City parts,District Councils,Longitude,Latitude,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
0,"Altstadt-Nord, Altstadt-Süd, Deutz, Neustadt-N...","Bezirksksamt Innenstadt Brückenstraße 19, D-50...",6.959234,50.937328,4.0,Italian Restaurant,Indian Restaurant,Vegetarian / Vegan Restaurant,Mexican Restaurant,Chinese Restaurant,Eastern European Restaurant,Fast Food Restaurant,German Restaurant,Greek Restaurant,Mediterranean Restaurant


# Results and Discussion

If I reflect the work necessary to create these results, what comes to my mind is that for typical ways of scraping, cleaning, handling, transforming and visualizing data, all the tools are simply there. We just have to get to know the available open source packages and learn how to use them. 

Those location candidates were then clustered to create zones of interest which contain greatest number of location candidates. Addresses of centers of those zones were also generated using reverse geocoding to be used as markers/starting points for more detailed local analysis based on other factors.

What I find fantastic is that nearly all of them are free of charge. Also, a simple notebook computer is enough. All the rest is concentrated, creative, interesting, sometimes hard work and searching for hints, tips, examples, explanations etc. in the web. With these tools, many exciting data science use cases can be created, for all kinds of useful purposes.

# Conclusion

We achieved the goal presented at the outset of this blogpost: tourists can see in the results which city districts best match their food desires. This is just one example of fantastic data science uses cases one can realize applying technology which is available for free today! What a time to be alive.

Purpose of this project was to identify Cologne areas close to center with low number of restaurants  in order to aid stakeholders in narrowing down the search for optimal location for a new restaurant. By calculating restaurant density distribution from Foursquare data we have first identified  analysis (köln Innenstadt), and then generated extensive collection of locations which satisfy some basic requirements regarding existing nearby restaurants. Clustering of those locations was then performed in order to create major zones of interest and addresses of those zone centers were created to be used as starting points for final exploration by stakeholders.

Final decission on optimal restaurant location will be made by stakeholders based on specific characteristics of neighborhoods and locations in every recommended zone, taking into consideration additional factors like attractiveness of each location, levels of noise / proximity to major roads, real estate availability, prices, social and economic dynamics of every neighborhood etc.