# Coursera Capstone - "Battle of the Neighborhoods"

### Catherine Gronemann


# REPORT

## Content

### 1. Introduction Section:

Description of the problem & discussion of the background (who would be interested in this project)

### 2. Data Section:

Description of the data that will be used to solve the problem and the source of the data

### 3. Methodology section :
(main component of the report)  

3.1 Getting the Data
3.2 Visualizing the data
3.3 Utilizing the Foursquare API to explore the neighborhoods venues and segment them
3.4 Inhabitants/bakery calculation
3.5 Clustering and evaluating the venue data

### 4. Results section

Discussion of results. 

### 5. Discussion section

Discussion of any observations you noted and any recommendations you can make based on the results.

### 6. Conclusion section



# 1. Introduction Section:

My client is a large bakery chain. They want to enter the market of southern Germanys capital Munich. As a bakery, they don’t particularly think it is best to have the stores in the center of the city – they consider it more important to be close to people’s homes as they found out in their market research that Germans like to get their fresh “Brötchen” from the bakery in the morning to eat them for breakfast at home. 
Additionally, competition plays a role of course: the client wants to build their stores in the neighborhood with the least competitors in the “bakery field”. 
Finally, they also want to consider all other venue categories. Many other venues might be favorable, as people will have a motivation other than the bakery to come to the area and then conveniently also buy at the bakery but could also be a threat to the bakery (if the customer for example decided to go to a Cafe or Restaurant instead). Thus, they which a segmentation of the neighborhoods overall venues to include this information in their final investment decision. 

So, the bakery chain wants me as a Data Scientist to find a neighborhood in Munich, Germany where there are many inhabitants but least bakerys/ #inhabitants and present a segmentation of the neighborhoods venues in general as additional decision support. All the information needs to be visualized in a management like manner. 

# 2. Data Section:

Data needed to examine the best neighborhood to build a bakery store in Munich:

- **inhabitants per neighborhood of Munich** (source: html table at https://suedbayerische-immobilien.de/Einwohner-Muenchen-Stadtteile)

- **location (longitude & latitude) of Munich** (source: geopy library)

- **location (longitude & latitude) of Munich’s neighborhoods centroids** (source:https://www.gps-latitude-longitude.com/address-to-longitude-latitude-gps-coordinates & https://www.google.com/maps/)

- **amount and location of bakeries in Munich - parsing them to the neighborhoods** (source: Foursquare)

- **amount and location of all other venues in Munich - parsing them to the neighborhoods** (source: Foursquare)  

- **map data to visualize the venues and neighborhoods for management** (source: Folium, which is a great visualization library. Its possible to zoom into the maps and click on each circle mark to reveal the name of the respective neighborhood.)


# 3. Methodology Section:

## 3.1 Getting the Data 

*In Terminal install the following packages:*  
**pip install beautifulsoup4** --> Beautiful Soup is a Python library for pulling data out of HTML and XML files  
**pip install lxml** --> an html parser  
**pip install request** --> request libary  
**pip install geocoder** --> to get Geodata

### Scraping the inhabitant data of Munich from a html table

In [155]:
#Importing Libaries:
from bs4 import BeautifulSoup
import requests
import lxml.html as lh
import pandas as pd

url = 'https://suedbayerische-immobilien.de/Einwohner-Muenchen-Stadtteile'

#page, to handle the contents of the website
page = requests.get(url)
#parse website and store contents under doc
doc = lh.fromstring(page.content)
#parse data that is stored between <tr>..</tr> of HTML
tr_elements = doc.xpath('//tr')


#Parsing the first row as header:

#Create empty list
col=[]
i=0

#For each column, store each first element (header) in an empty list
for t in tr_elements[0]:
    i+=1
    name=t.text_content()
    col.append((name,[]))

col


# Parsing the data from row 2+ (since first row is the header):

for j in range(1,len(tr_elements)):
    #T is our j'th row
    T=tr_elements[j]
    
    #If column is not of size 3, the //tr data is not from our table --> insert correct number of columns here <--
    if len(T)!=3:
        break
    
    #i is the index of our column
    i=0
    
    #Iterate through each element of the row
    for t in T.iterchildren():
        data=t.text_content() 
        #Check if row is empty
        if i>0:
        #Convert any numerical value to integers
            try:
                data=int(data)
            except:
                pass
        #Append the data to the empty list of the i'th column
        col[i][1].append(data)
        #Increment i for the next column
        i+=1


# Create the DataFrame
Dict = {title:column for (title,column) in col}
df_munich = pd.DataFrame(Dict)


# Transform the DataFrame:

# drop last column since its not needed  
df_munich = df_munich.drop(labels='Einwohner in % der Gesamtbevölkerung Münchens', axis=1)

#Rename Columns to english
df_munich.columns = ['Neighborhood', 'Inhabitants']

# drop row 25 "münch overall"
df_munich = df_munich.iloc[:25] 

df_munich

Unnamed: 0,Neighborhood,Inhabitants
0,1 Ramersdorf - Perlach,108.244
1,2 Neuhausen - Nymphenburg,95.906
2,3 Thalkirchen - Obersendling - Forstenried - F...,90.79
3,4 Bogenhausen,82.138
4,5 Milbertshofen - Am Hart,73.617
5,6 Pasing - Obermenzing,70.783
6,7 Schwabing - Freimann,69.676
7,8 Trudering - Riem,67.009
8,9 Schwabing West,65.892
9,10 Au - Haidhausen,59.752


### Retrieving Geodata 

In [156]:
'''#import geocoder:
import geocoder 

# initialize variable to None
lat_lng_coords = None

# loop until the coordinates are fetched
#for index, row in df.iterrows():
while(lat_lng_coords is None):
    g = geocoder.google('Sendling, Munich')
    lat_lng_coords = g.latlng

latitude = lat_lng_coords[0]
longitude = lat_lng_coords[1]'''

"#import geocoder:\nimport geocoder \n\n# initialize variable to None\nlat_lng_coords = None\n\n# loop until the coordinates are fetched\n#for index, row in df.iterrows():\nwhile(lat_lng_coords is None):\n    g = geocoder.google('Sendling, Munich')\n    lat_lng_coords = g.latlng\n\nlatitude = lat_lng_coords[0]\nlongitude = lat_lng_coords[1]"

In [157]:
# Geocoder is not performing - therefore https://www.gps-latitude-longitude.com/address-to-longitude-latitude-gps-coordinates & https://www.google.com/maps/ is used to create excel with Geodata of neighboorhoods.

# Loading excel with Geodata:
df_lat_lng = pd.read_excel (r'LatiLong_Munich.xlsx')
print (df_lat_lng)


                                         Neighborhood   Latitude  Longitude
0                              1 Ramersdorf - Perlach  48.103607  11.633565
1                           2 Neuhausen - Nymphenburg  48.155115  11.523016
2   3 Thalkirchen - Obersendling - Forstenried - F...  48.086792  11.513272
3                                       4 Bogenhausen  48.157355  11.649248
4                           5 Milbertshofen - Am Hart  48.210554  11.572193
5                              6 Pasing - Obermenzing  48.146631  11.459348
6                              7 Schwabing - Freimann  48.201196  11.614568
7                                  8 Trudering - Riem  48.128667  11.683546
8                                    9 Schwabing West  48.167852  11.571096
9                                  10 Au - Haidhausen  48.128592  11.593926
10                        11 Feldmoching - Hasenbergl  48.211504  11.513181
11                             12 Sendling - Westpark  48.115190  11.519808
12          

In [158]:
#Merging the two dataframes on Neighborhood Name
left = df_munich
right = df_lat_lng

result_df = pd.merge(left, right, on='Neighborhood')
result_df.head()

Unnamed: 0,Neighborhood,Inhabitants,Latitude,Longitude
0,1 Ramersdorf - Perlach,108.244,48.103607,11.633565
1,2 Neuhausen - Nymphenburg,95.906,48.155115,11.523016
2,3 Thalkirchen - Obersendling - Forstenried - F...,90.79,48.086792,11.513272
3,4 Bogenhausen,82.138,48.157355,11.649248
4,5 Milbertshofen - Am Hart,73.617,48.210554,11.572193


## 3.2 Visualizing the data

In [159]:
#importinng necessary libaries
import numpy as np 

#import geocoder
import geocoder 

# convert an address into latitude and longitude values
!conda install -c conda-forge geopy --yes # uncomment this line if you haven't completed the Foursquare API lab
from geopy.geocoders import Nominatim 

# Matplotlib and associated plotting modules
import matplotlib.cm as cm
import matplotlib.colors as colors

# import k-means from clustering stage
from sklearn.cluster import KMeans

# map rendering library
!conda install -c conda-forge folium=0.5.0 --yes # uncomment this line if you haven't completed the Foursquare API lab
import folium 

print('Libraries imported.')

Collecting package metadata: done
Solving environment: done

## Package Plan ##

  environment location: /Users/Kate/anaconda3

  added / updated specs:
    - geopy


The following packages will be downloaded:

    package                    |            build
    ---------------------------|-----------------
    geopy-1.20.0               |             py_0          57 KB  conda-forge
    ------------------------------------------------------------
                                           Total:          57 KB

The following packages will be UPDATED:

  geopy                                         1.19.0-py_0 --> 1.20.0-py_0



Downloading and Extracting Packages
geopy-1.20.0         | 57 KB     | ##################################### | 100% 
Preparing transaction: done
Verifying transaction: done
Executing transaction: done
Collecting package metadata: done
Solving environment: done

# All requested packages already installed.

Libraries imported.


### Using geopy library to get the latitude and longitude values of Munich 

In [160]:
address = 'Munich, Germany'

geolocator = Nominatim(user_agent="munich_explorer")
location = geolocator.geocode(address)
latitude = location.latitude
longitude = location.longitude
print('The geograpical coordinate of Munich are {}, {}.'.format(latitude, longitude))

The geograpical coordinate of Munich are 48.1371079, 11.5753822.


### Creating a map of Munich with neighborhoods displayed on it

In [161]:
# create map of Munich using latitude and longitude values
map_munich = folium.Map(location=[latitude, longitude], zoom_start=10)

# add markers to map
for lat, lng, neighborhood in zip(result_df['Latitude'], result_df['Longitude'], result_df['Neighborhood']):
    label = '{}'.format(neighborhood)
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=5,
        popup=label,
        color='blue',
        fill=True,
        fill_color='#3186cc',
        fill_opacity=0.7,
        parse_html=False).add_to(map_munich)  
    
map_munich

## 3.3 Utilizing the Foursquare API to explore the Neighborhoods Venues and segment them

In [162]:
#Define Foursquare Credentials and Version
CLIENT_ID = #'*****' # your Foursquare ID
CLIENT_SECRET = #'*****' # your Foursquare Secret
VERSION = '20180605' # Foursquare API version

#library to handle JSON files
import json 

#tranform JSON file into a pandas dataframe
from pandas.io.json import json_normalize 



#Explore Neighborhoods Venues and enrich the dataframe with it:


#function to define nearby venues

#limit of number of venues returned by Foursquare API
LIMIT = 200 
#define radius for search
radius = 800 

def getNearbyVenues(names, latitudes, longitudes, radius):
    
    #create empty list
    venues_list=[]
    
    for name, lat, lng in zip(names, latitudes, longitudes):
        #print(name)
            
        # create the API request URL
        url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
            CLIENT_ID, 
            CLIENT_SECRET, 
            VERSION, 
            lat, 
            lng, 
            radius, 
            LIMIT)
            
        # make the GET request
        results = requests.get(url).json()["response"]['groups'][0]['items']
        
        # return only relevant information for each nearby venue
        venues_list.append([(
            name, 
            lat, 
            lng, 
            v['venue']['name'], 
            v['venue']['location']['lat'], 
            v['venue']['location']['lng'],  
            v['venue']['categories'][0]['name']) for v in results])

    nearby_venues = pd.DataFrame([item for venue_list in venues_list for item in venue_list])
    nearby_venues.columns = ['Neighborhood', 
                  'Neighborhood Latitude', 
                  'Neighborhood Longitude', 
                  'Venue', 
                  'Venue Latitude', 
                  'Venue Longitude', 
                  'Venue Category']
    
    return(nearby_venues)



# running the above function on each neighborhood and creating a new df called munich_venues:
munich_venues = getNearbyVenues(names=result_df['Neighborhood'],
                                   latitudes=result_df['Latitude'],
                                   longitudes=result_df['Longitude'], 
                                   radius = 800 
                                  )
print(munich_venues.shape)

#Merging new retrieved data with existing df info of neighborhoods
left = result_df
right = munich_venues
neighborhood_venues = pd.merge(left, right, on='Neighborhood')

#drop columns Latitude and Longitude since they are duplicates now
neighborhood_venues = neighborhood_venues.drop(columns=['Latitude', 'Longitude'])
neighborhood_venues.head()


(1024, 7)


Unnamed: 0,Neighborhood,Inhabitants,Neighborhood Latitude,Neighborhood Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
0,1 Ramersdorf - Perlach,108.244,48.103607,11.633565,Der Hufnagel,48.101297,11.628676,German Restaurant
1,1 Ramersdorf - Perlach,108.244,48.103607,11.633565,Pfanzeltplatz,48.100657,11.63047,Plaza
2,1 Ramersdorf - Perlach,108.244,48.103607,11.633565,Ana's Feinkost,48.100622,11.63038,Diner
3,1 Ramersdorf - Perlach,108.244,48.103607,11.633565,Lidl,48.106352,11.635778,Supermarket
4,1 Ramersdorf - Perlach,108.244,48.103607,11.633565,Roma,48.100706,11.625217,Italian Restaurant


#### Analyzing the venues

In [163]:
# Checking how many venues were returned for each Neighborhood
print(neighborhood_venues.groupby('Neighborhood').count())

# Checking how many unique venue categories exist in order to decide if it makes sence to transfer there categorical values to numerical (one hot encoding)
print('There are {} uniques categories.'.format(len(neighborhood_venues['Venue Category'].unique())))


                                                    Inhabitants  \
Neighborhood                                                      
1 Ramersdorf - Perlach                                       23   
10 Au - Haidhausen                                          100   
11 Feldmoching - Hasenbergl                                   1   
12 Sendling - Westpark                                       20   
13 Laim                                                      43   
14 Untergiesing - Harlaching                                 29   
15 Maxvorstadt                                              100   
16 Moosach                                                   29   
17 Obergiesing - Fasanengarten                               10   
18 Ludwigsvorstadt - Isarvorstadt                           100   
19 Hadern                                                    16   
2 Neuhausen - Nymphenburg                                    42   
20 Berg am Laim                                              1

191 categories seems fine for onehot encoding, so the categorical data is changed to numerical with the goal of applying k-means clustering afterwards.  

## 3.4 Inhabitants/Bakery calculation 

In [164]:
#reduce df to only bakery venues
df_bakery = neighborhood_venues[neighborhood_venues['Venue Category'] == 'Bakery']

#count the occurances fpr each Neighborhood
number_bakeries = pd.DataFrame(df_bakery['Neighborhood'].value_counts())

# change the index, so that neighborhood is a column again
number_bakeries.reset_index(level=0, inplace=True)

#Rename Column Neighborhood to Occurance
number_bakeries.columns = ['Neighborhood','NumberBakeries']

#merging the inhabitant data with the bakery numbers
left = number_bakeries
right = df_munich
inh_bak_merged = pd.merge(left, right, how='outer', on='Neighborhood')

inh_bak_merged

Unnamed: 0,Neighborhood,NumberBakeries,Inhabitants
0,22 Sendling,5.0,39.953
1,16 Moosach,4.0,51.537
2,13 Laim,3.0,54.03
3,10 Au - Haidhausen,3.0,59.752
4,24 Schwanthalerhöhe,3.0,29.663
5,2 Neuhausen - Nymphenburg,2.0,95.906
6,15 Maxvorstadt,2.0,51.642
7,20 Berg am Laim,2.0,43.068
8,18 Ludwigsvorstadt - Isarvorstadt,2.0,50.62
9,8 Trudering - Riem,2.0,67.009


In [165]:
# convert columns 'Inhabitants' and 'NumberBakeries' to numeric for calculation
inh_bak_merged[['Inhabitants', 'NumberBakeries']] = inh_bak_merged[['Inhabitants', 'NumberBakeries']].apply(pd.to_numeric)

# create a new column "inh_per_bakery" which is the number of inhabitants devided by the number of bakeries per neighborhood
inh_bak_merged ['inh_per_bakery'] = inh_bak_merged[['Inhabitants']].div(inh_bak_merged['NumberBakeries'].values,axis=0)

#replace NaN values in "inh_per_bakery" with inhabitant value of that row
inh_bak_merged.inh_per_bakery.fillna(inh_bak_merged.Inhabitants, inplace=True)

# reorder inhabitants per bakery column, so that the neighborhoods with least bakeries per inhabitant are shown on top 
inh_bak_merged.sort_values(by=['inh_per_bakery'], inplace=True, ascending=False)

inh_bak_merged

Unnamed: 0,Neighborhood,NumberBakeries,Inhabitants,inh_per_bakery
18,3 Thalkirchen - Obersendling - Forstenried - F...,,90.79,90.79
12,4 Bogenhausen,1.0,82.138,82.138
15,5 Milbertshofen - Am Hart,1.0,73.617,73.617
19,7 Schwabing - Freimann,,69.676,69.676
20,9 Schwabing West,,65.892,65.892
21,11 Feldmoching - Hasenbergl,,59.391,59.391
16,12 Sendling - Westpark,1.0,55.405,55.405
11,1 Ramersdorf - Perlach,2.0,108.244,54.122
17,14 Untergiesing - Harlaching,1.0,51.937,51.937
22,17 Obergiesing - Fasanengarten,,51.499,51.499


## 3.5 Clustering and evaluating the venue data

In [166]:
# Changing categorical data of "venue category" to numerical for clustering

# one hot encoding
neighborhood_onehot = pd.get_dummies(neighborhood_venues[['Venue Category']], prefix="", prefix_sep="")

# add neighborhood column to new onehot df
neighborhood_onehot['Neighborhood'] = neighborhood_venues['Neighborhood'] 

# move neighborhood column to the first column
fixed_columns = [neighborhood_onehot.columns[-1]] + list(neighborhood_onehot.columns[:-1])
neighborhood_onehot = neighborhood_onehot[fixed_columns]

neighborhood_onehot.head()

Unnamed: 0,Neighborhood,Afghan Restaurant,African Restaurant,American Restaurant,Aquarium,Arcade,Art Gallery,Art Museum,Asian Restaurant,Athletics & Sports,...,Tram Station,Trattoria/Osteria,Tunnel,Turkish Restaurant,Vegetarian / Vegan Restaurant,Vietnamese Restaurant,Wine Bar,Wine Shop,Xinjiang Restaurant,Zoo Exhibit
0,1 Ramersdorf - Perlach,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,1 Ramersdorf - Perlach,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,1 Ramersdorf - Perlach,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,1 Ramersdorf - Perlach,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,1 Ramersdorf - Perlach,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [167]:
neighborhood_onehot.shape

(1024, 193)

### Grouping rows by borough and taking the mean of occurrence frequency for each category

In [168]:
neighborhood_grouped = neighborhood_onehot.groupby('Neighborhood').mean().reset_index()
neighborhood_grouped

Unnamed: 0,Neighborhood,Afghan Restaurant,African Restaurant,American Restaurant,Aquarium,Arcade,Art Gallery,Art Museum,Asian Restaurant,Athletics & Sports,...,Tram Station,Trattoria/Osteria,Tunnel,Turkish Restaurant,Vegetarian / Vegan Restaurant,Vietnamese Restaurant,Wine Bar,Wine Shop,Xinjiang Restaurant,Zoo Exhibit
0,1 Ramersdorf - Perlach,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,10 Au - Haidhausen,0.01,0.0,0.0,0.0,0.0,0.0,0.0,0.01,0.0,...,0.0,0.01,0.0,0.02,0.02,0.0,0.01,0.0,0.0,0.0
2,11 Feldmoching - Hasenbergl,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,12 Sendling - Westpark,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.1,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,13 Laim,0.0,0.0,0.023256,0.0,0.0,0.0,0.0,0.0,0.0,...,0.046512,0.023256,0.0,0.0,0.0,0.0,0.0,0.023256,0.0,0.0
5,14 Untergiesing - Harlaching,0.0,0.0,0.0,0.034483,0.0,0.0,0.0,0.0,0.0,...,0.068966,0.0,0.0,0.0,0.0,0.0,0.0,0.034483,0.0,0.172414
6,15 Maxvorstadt,0.0,0.0,0.0,0.0,0.0,0.0,0.04,0.01,0.0,...,0.0,0.01,0.0,0.0,0.01,0.04,0.0,0.0,0.0,0.0
7,16 Moosach,0.0,0.0,0.034483,0.0,0.034483,0.0,0.0,0.0,0.0,...,0.068966,0.034483,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
8,17 Obergiesing - Fasanengarten,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
9,18 Ludwigsvorstadt - Isarvorstadt,0.0,0.01,0.01,0.0,0.0,0.01,0.0,0.03,0.0,...,0.0,0.0,0.0,0.0,0.01,0.04,0.01,0.01,0.01,0.0


### Creating a new dataframe and display the top 10 venues for each Neighborhood

In [169]:
#function to sort the venues in descending order
def return_most_common_venues(row, num_top_venues):
    row_categories = row.iloc[1:]
    row_categories_sorted = row_categories.sort_values(ascending=False)
    
    return row_categories_sorted.index.values[0:num_top_venues]

In [170]:
num_top_venues = 10

indicators = ['st', 'nd', 'rd']

# create columns according to number of top venues
columns = ['Neighborhood']
for ind in np.arange(num_top_venues):
    try:
        columns.append('{}{} Most Common Venue'.format(ind+1, indicators[ind]))
    except:
        columns.append('{}th Most Common Venue'.format(ind+1))

# create a new dataframe
neighborhood_venues_sorted = pd.DataFrame(columns=columns)
neighborhood_venues_sorted['Neighborhood'] = neighborhood_grouped['Neighborhood']

# fill df with most commmon venues
for ind in np.arange(neighborhood_grouped.shape[0]):
    neighborhood_venues_sorted.iloc[ind, 1:] = return_most_common_venues(neighborhood_grouped.iloc[ind, :], num_top_venues)

neighborhood_venues_sorted

Unnamed: 0,Neighborhood,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
0,1 Ramersdorf - Perlach,Supermarket,German Restaurant,Bakery,Hotel,Bus Stop,Italian Restaurant,Ice Cream Shop,Plaza,Bus Line,Market
1,10 Au - Haidhausen,Italian Restaurant,Café,German Restaurant,Plaza,French Restaurant,Bakery,Ice Cream Shop,Beach,Bar,Gourmet Shop
2,11 Feldmoching - Hasenbergl,Lake,Zoo Exhibit,Event Service,Food & Drink Shop,Food,Flower Shop,Fish Market,Field,Fast Food Restaurant,Farmers Market
3,12 Sendling - Westpark,Bus Stop,Supermarket,Greek Restaurant,Tunnel,Ice Cream Shop,Brewery,Metro Station,Liquor Store,German Restaurant,Coffee Shop
4,13 Laim,Supermarket,Bakery,Bus Stop,Gastropub,Restaurant,Greek Restaurant,Tram Station,Bank,Plaza,Hotel
5,14 Untergiesing - Harlaching,Zoo Exhibit,Soccer Field,Tram Station,Sports Club,Supermarket,German Restaurant,Bus Stop,Café,Lawyer,Beer Garden
6,15 Maxvorstadt,Café,Bar,Italian Restaurant,Vietnamese Restaurant,Art Museum,Burger Joint,Mediterranean Restaurant,Plaza,Pizza Place,French Restaurant
7,16 Moosach,Supermarket,Bakery,Plaza,Tram Station,Drugstore,Hotel,Metro Station,Light Rail Station,Lawyer,Gastropub
8,17 Obergiesing - Fasanengarten,Hotel,Supermarket,Toy / Game Store,Office,Museum,Bus Stop,German Restaurant,Pie Shop,Gym,Ethiopian Restaurant
9,18 Ludwigsvorstadt - Isarvorstadt,Café,Italian Restaurant,German Restaurant,Burger Joint,Vietnamese Restaurant,Hotel,Ice Cream Shop,Asian Restaurant,Bar,Greek Restaurant


### Clustering the Neighborhoods into 5 clusters with *k*-means based on their venue categories

Clustering was chosen because it is a good way to segment a collection of datapoints into smaller groups with similar attributes. This way the information is aggregated and can be displayed in a management ready format. While it would not be realistic for most cases to have a different marketing strategy for each datapoint, it is realtivley common to have different approaches for each segment. Thus, it is very important to know how to differentiate your customers! 

In [171]:
# set number of clusters
k = 5

neighborhood_grouped_clustering = neighborhood_grouped.drop('Neighborhood', 1)

# run k-means clustering
kmeans = KMeans(n_clusters=k, random_state=0).fit(neighborhood_grouped_clustering)

# check cluster labels generated for each row in the dataframe
kmeans.labels_[0:25] 

array([2, 0, 1, 2, 2, 0, 0, 2, 2, 0, 2, 0, 2, 4, 0, 2, 0, 0, 2, 2, 2, 0,
       3, 0, 0], dtype=int32)

### Creating a new dataframe that includes the clusters, the top 10 venues and the Geodata

In [172]:
# add clustering labels
neighborhood_venues_sorted.insert(0, 'Cluster Labels', kmeans.labels_)

# getting latitude/longitude for each neighborhood
neighborhood_geodata = neighborhood_venues.groupby('Neighborhood')[['Neighborhood Latitude','Neighborhood Longitude']].mean()

#merging the cluster df with the geodata df
left = neighborhood_venues_sorted
right = neighborhood_geodata
neighborhood_merged = pd.merge(left, right, on='Neighborhood')

neighborhood_merged


Unnamed: 0,Cluster Labels,Neighborhood,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue,Neighborhood Latitude,Neighborhood Longitude
0,2,1 Ramersdorf - Perlach,Supermarket,German Restaurant,Bakery,Hotel,Bus Stop,Italian Restaurant,Ice Cream Shop,Plaza,Bus Line,Market,48.103607,11.633565
1,0,10 Au - Haidhausen,Italian Restaurant,Café,German Restaurant,Plaza,French Restaurant,Bakery,Ice Cream Shop,Beach,Bar,Gourmet Shop,48.128592,11.593926
2,1,11 Feldmoching - Hasenbergl,Lake,Zoo Exhibit,Event Service,Food & Drink Shop,Food,Flower Shop,Fish Market,Field,Fast Food Restaurant,Farmers Market,48.211504,11.513181
3,2,12 Sendling - Westpark,Bus Stop,Supermarket,Greek Restaurant,Tunnel,Ice Cream Shop,Brewery,Metro Station,Liquor Store,German Restaurant,Coffee Shop,48.11519,11.519808
4,2,13 Laim,Supermarket,Bakery,Bus Stop,Gastropub,Restaurant,Greek Restaurant,Tram Station,Bank,Plaza,Hotel,48.137068,11.502451
5,0,14 Untergiesing - Harlaching,Zoo Exhibit,Soccer Field,Tram Station,Sports Club,Supermarket,German Restaurant,Bus Stop,Café,Lawyer,Beer Garden,48.100404,11.566378
6,0,15 Maxvorstadt,Café,Bar,Italian Restaurant,Vietnamese Restaurant,Art Museum,Burger Joint,Mediterranean Restaurant,Plaza,Pizza Place,French Restaurant,48.149976,11.573622
7,2,16 Moosach,Supermarket,Bakery,Plaza,Tram Station,Drugstore,Hotel,Metro Station,Light Rail Station,Lawyer,Gastropub,48.181312,11.518036
8,2,17 Obergiesing - Fasanengarten,Hotel,Supermarket,Toy / Game Store,Office,Museum,Bus Stop,German Restaurant,Pie Shop,Gym,Ethiopian Restaurant,48.10167,11.592276
9,0,18 Ludwigsvorstadt - Isarvorstadt,Café,Italian Restaurant,German Restaurant,Burger Joint,Vietnamese Restaurant,Hotel,Ice Cream Shop,Asian Restaurant,Bar,Greek Restaurant,48.129431,11.55984


### Visualizing the resulting clusters

In [173]:
# create map
map_clusters = folium.Map(location=[latitude, longitude], zoom_start=11)

# set color scheme for the clusters
x = np.arange(k)
ys = [i + x + (i*x)**2 for i in range(k)]
colors_array = cm.rainbow(np.linspace(0, 1, len(ys)))
rainbow = [colors.rgb2hex(i) for i in colors_array]

# add markers to the map
markers_colors = []
for lat, lon, poi, cluster in zip(neighborhood_merged['Neighborhood Latitude'], neighborhood_merged['Neighborhood Longitude'], neighborhood_merged['Neighborhood'], neighborhood_merged['Cluster Labels']):
    label = folium.Popup(str(poi) + ' Cluster ' + str(cluster), parse_html=True)
    folium.CircleMarker(
        [lat, lon],
        radius=5,
        popup=label,
        color=rainbow[cluster-1],
        fill=True,
        fill_color=rainbow[cluster-1],
        fill_opacity=0.7).add_to(map_clusters)
       
map_clusters

# 4. Results Section:

## 4.1 Insights from the bakery information:  

The table below shows the inhabitant per bakery for each neighborhood, including the neighborhoods that don't have a bakery (listed in Foresqare).  

The neighborhoods "Thalkirchen - Obersendling - Forstenried - F.",  "Bogenhausen" and "Milbertshofen - Am Hart" all have over 70000 inhabitants per bakery. 

In [174]:
inh_bak_merged

Unnamed: 0,Neighborhood,NumberBakeries,Inhabitants,inh_per_bakery
18,3 Thalkirchen - Obersendling - Forstenried - F...,,90.79,90.79
12,4 Bogenhausen,1.0,82.138,82.138
15,5 Milbertshofen - Am Hart,1.0,73.617,73.617
19,7 Schwabing - Freimann,,69.676,69.676
20,9 Schwabing West,,65.892,65.892
21,11 Feldmoching - Hasenbergl,,59.391,59.391
16,12 Sendling - Westpark,1.0,55.405,55.405
11,1 Ramersdorf - Perlach,2.0,108.244,54.122
17,14 Untergiesing - Harlaching,1.0,51.937,51.937
22,17 Obergiesing - Fasanengarten,,51.499,51.499


## 4.2. Insights from the venue cluster information: 

The resulting clusters show, that the venue structure is different depending on the centricity of the neighborhood: the most central neighboorhods/best loactions are therefore within one cluster (red), while the "secound best" neighborhoddds form a circle like cluster around it (blue). The 3 outer most neighborhoods are unlike the others in their venue representation and thus each form their own cluster.   
How does this information help the client?  
If the client decides to build a bakery store in one of the red/center neighborhoods they can have a look at the most common venues there and adjust their product protfolio accordingly. In the red neighborhoods for example are much more Cafes, Restaurant and Bars listed, then in the other clusters. The outer clusters on the other hand have supermarkets and transportation facilities listed under the most common venues. If the client decides to build a store in the red areas they should be aware that there is more distraction from other food venues and thus, the clients products should have a USP over these places. If the client decides to build more on the outer skirts of the city they could seek alliances with the supermarkets present there or adjust their products to the cutomers needs who seems to spent some time traveling into the city. So, bakery goods that are easy to consume while traveling might be a good proposal there.

In [175]:
# neighborhoods cluster visualization
map_clusters

In [176]:
# neighborhoods, their venues and cluster label
neighborhood_merged

Unnamed: 0,Cluster Labels,Neighborhood,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue,Neighborhood Latitude,Neighborhood Longitude
0,2,1 Ramersdorf - Perlach,Supermarket,German Restaurant,Bakery,Hotel,Bus Stop,Italian Restaurant,Ice Cream Shop,Plaza,Bus Line,Market,48.103607,11.633565
1,0,10 Au - Haidhausen,Italian Restaurant,Café,German Restaurant,Plaza,French Restaurant,Bakery,Ice Cream Shop,Beach,Bar,Gourmet Shop,48.128592,11.593926
2,1,11 Feldmoching - Hasenbergl,Lake,Zoo Exhibit,Event Service,Food & Drink Shop,Food,Flower Shop,Fish Market,Field,Fast Food Restaurant,Farmers Market,48.211504,11.513181
3,2,12 Sendling - Westpark,Bus Stop,Supermarket,Greek Restaurant,Tunnel,Ice Cream Shop,Brewery,Metro Station,Liquor Store,German Restaurant,Coffee Shop,48.11519,11.519808
4,2,13 Laim,Supermarket,Bakery,Bus Stop,Gastropub,Restaurant,Greek Restaurant,Tram Station,Bank,Plaza,Hotel,48.137068,11.502451
5,0,14 Untergiesing - Harlaching,Zoo Exhibit,Soccer Field,Tram Station,Sports Club,Supermarket,German Restaurant,Bus Stop,Café,Lawyer,Beer Garden,48.100404,11.566378
6,0,15 Maxvorstadt,Café,Bar,Italian Restaurant,Vietnamese Restaurant,Art Museum,Burger Joint,Mediterranean Restaurant,Plaza,Pizza Place,French Restaurant,48.149976,11.573622
7,2,16 Moosach,Supermarket,Bakery,Plaza,Tram Station,Drugstore,Hotel,Metro Station,Light Rail Station,Lawyer,Gastropub,48.181312,11.518036
8,2,17 Obergiesing - Fasanengarten,Hotel,Supermarket,Toy / Game Store,Office,Museum,Bus Stop,German Restaurant,Pie Shop,Gym,Ethiopian Restaurant,48.10167,11.592276
9,0,18 Ludwigsvorstadt - Isarvorstadt,Café,Italian Restaurant,German Restaurant,Burger Joint,Vietnamese Restaurant,Hotel,Ice Cream Shop,Asian Restaurant,Bar,Greek Restaurant,48.129431,11.55984


# 5. Discussion Section:

### Foresquare Database 
The venues were retrieved by foresquare which has a vast collection of venues worldwide but of course is not complete. Thus, the venues and calculations made in this exercise are only a rough direction for a decision but should not be concidered to be very precise. Personally, living in Munich, I am pretty sure there is more than one bakery in Bogenhausen. So the bove calculations are to be seen as an example of how to solve such a use case but for a real client another venue retrieving service should be taken instead or additional to Foresquare.  


### Neighborhood Area
One assumption of the calculation is that the neighborhoods of Munich are circular and all have the radius r. Of course this is not the case but due to the difficulty in even retrieving the centroid Geodata for each neighborhood it was not possible to display the neighborhoods geodata even closer to reality. But due to this assumption some venues of a nieghborhood might haven't been considered because they were outside of the specified r around the centroid and other venues that belong to a neighboring neighborhood might have been included in more than one neighborhood if they lay within r of more than one centroid.  


### Clustering
The number k of Clusters is randomly chosen. This mights not be the best segmentation of the neighborhoods.   


# 6. Conclusion Section:

In Conclusion to both calculation sections, I would recommend the client to build their stores in the neighborhoods
"Thalkirchen - Obersendling - Forstenried - F.",  "Bogenhausen" and "Milbertshofen - Am Hart" which all have over 70000 inhabitants per bakery.  

Concerning the Clustering the results of these 3 neighborhoods: they are all in Cluster 2. Cluster 2 can be characterized by beeing the secound circle (around the core of munich), with mainly supermarkets, hotels and public transport venues.  
Thus it seems that the best location for the bakery stores are the neighborhoods that lie around the core of munich, not the city center itself but also not the most desolate places outside. The competition is not too strong here and there's options to start alliances with the existing supermarkets and/or hotels in the areas.  
By adjusting the product protfolio a little to the needs of their cutomers in these areas I am optimistic that the investment will be worthwhile!