# Capstone Project - The Battle of the Neighborhoods 
## Data Science Professional Certificate by IBM/Coursera
### Author: Sergio P.C. 

## Table of contents
* [Introduction](#introduction)
* [Data](#data)
* [Methodology](#methodology)
* [Analysis](#analysis)
* [Results and Discussion](#results)
* [Conclusion](#conclusion)



## Introduction <a name="introduction"></a>

This project will be to analyze the possibility of opening a new restaurant that serves authentic Mexican food in Frankfurt. Often considered the city with the most Mexican expats in Germany, and having already a fair number of mexican restaurants but with mostly modified recipes that lose the original taste.

Therefore, the best location will be found that takes into consideration in order of importance the following points:

- Existense of other Mexican restaurants
- Existense of other restaurants 
- Proximity to key areas: Close to metro station, plazas, downtown

Aqcquiring of the data and first cleaning will be performed using Python and its libraries (Pandas, Numpy, Scipy). Subsequently an unsupervised machine learning algorithm will be used to create clusters of neighborhoods based on the above criteria to then evaluate a recommendation to stakeholders for a location to open the mexican restaurant. 

## Data <a name="data"></a>

Based on the criteria of the task given above, the primary factors will aid in choosing a location are:
* Existing restaurants in a given neighborhood
* Existing mexican restaurants 
* Distance to key establishments: metro station, plazas and downtown. 

Neighborhoods (Stadtteile) will be defined as per description from Wikipedia. And only the main neighborhoods close to the city center (Innenstadt, Altstadt) will be considered. 

Sources for the data used will be:
* Wikipedia for the neighborhood denominations
* Restaurant location and type from the **Foursquare API**
* Folium for providing map visuals


### Neighborhoods (Stadtteile)

Names of the neighborhoods and their coordinates will be scraped from the Wikipedia page, using the built in API for python. 

In [20]:
# Initial necessary libraries
import numpy as np
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', None)
import wikipedia as wp

In [2]:
wp.set_lang("de") # Language set to German, since the English site does not have the Stadtteil list
frank_page = 'Liste_der_Stadtteile_von_Frankfurt_am_Main'
html = wp.page(frank_page).html()
frank_stdteile = pd.read_html(html)[0] # Neighborhoods table imported

In [3]:
frank_stdteile

Unnamed: 0,Nr.,Stadtteil,Fläche[3]in km²,Einwohner,Weiblich,Männlich,Deutsche,Ausländer,Ausländerin Prozent,Einwohnerje km²,Ortsbezirk,Stadtgebietseit,Vorherige Zugehörigkeit
0,1,Altstadt,,,,,,,373,8204,01 Innenstadt I,1866[Anm. 1],Freie Stadt Frankfurt
1,2,Innenstadt,,,,,,,468,4430,01 Innenstadt I,1866[Anm. 2],Freie Stadt Frankfurt
2,3,Bahnhofsviertel,,,,,,,54,6570,01 Innenstadt I,1866[Anm. 3],Freie Stadt Frankfurt
3,4,Westend-Süd,,,,,,,275,7538,02 Innenstadt II,1866[Anm. 3],Freie Stadt Frankfurt
4,5,Westend-Nord,,,,,,,293,6249,02 Innenstadt II,1866[Anm. 3],Freie Stadt Frankfurt
5,6,Nordend-West,,,,,,,222,9845,03 Innenstadt III,1866[Anm. 3],Freie Stadt Frankfurt
6,7,Nordend-Ost,,,,,,,224,15031,03 Innenstadt III,1866[Anm. 3],Freie Stadt Frankfurt
7,8,Ostend,,,,,,,287,5243,04 Bornheim/Ostend,1866[Anm. 3],Freie Stadt Frankfurt
8,9,Bornheim,,,,,,,24,10959,04 Bornheim/Ostend,1877,Stadtkreis Frankfurt am Main[Anm. 4]
9,10,Gutleutviertel,,,,,,,426,3864,01 Innenstadt I,1866[Anm. 3],Freie Stadt Frankfurt


Cleaning up the table. 

In [4]:
neighs = frank_stdteile[['Stadtteil']]

In [5]:
neighs

Unnamed: 0,Stadtteil
0,Altstadt
1,Innenstadt
2,Bahnhofsviertel
3,Westend-Süd
4,Westend-Nord
5,Nordend-West
6,Nordend-Ost
7,Ostend
8,Bornheim
9,Gutleutviertel


In [6]:
neighs = neighs[:26]

In [7]:
neighs

Unnamed: 0,Stadtteil
0,Altstadt
1,Innenstadt
2,Bahnhofsviertel
3,Westend-Süd
4,Westend-Nord
5,Nordend-West
6,Nordend-Ost
7,Ostend
8,Bornheim
9,Gutleutviertel


In [9]:
# Using geocoders the latitudes and longitudes will be acquired
from geopy.geocoders import Nominatim

In [10]:
latitudes = []
longitudes = []

geolocator = Nominatim(user_agent="frankfurt_explorer")

for teil in neighs['Stadtteil']:
    address = f'{teil}, Frankfurt am Main, Germany'    
    location = geolocator.geocode(address)
    latitudes.append(location.latitude)
    longitudes.append(location.longitude)
    print('Got coordinates of {}.'.format(teil))

Got coordinates of Altstadt.
Got coordinates of Innenstadt.
Got coordinates of Bahnhofsviertel.
Got coordinates of Westend-Süd.
Got coordinates of Westend-Nord.
Got coordinates of Nordend-West.
Got coordinates of Nordend-Ost.
Got coordinates of Ostend.
Got coordinates of Bornheim.
Got coordinates of Gutleutviertel.
Got coordinates of Gallus.
Got coordinates of Bockenheim.
Got coordinates of Sachsenhausen-Nord.
Got coordinates of Sachsenhausen-Süd.
Got coordinates of Flughafen.
Got coordinates of Oberrad.
Got coordinates of Niederrad.
Got coordinates of Schwanheim.
Got coordinates of Griesheim.
Got coordinates of Rödelheim.
Got coordinates of Hausen.
Got coordinates of Praunheim.
Got coordinates of Heddernheim.
Got coordinates of Niederursel.
Got coordinates of Ginnheim.
Got coordinates of Dornbusch.


In [11]:
# DataFrame with the neighborhood name and latitude and longitude is created
fra_df = pd.DataFrame({'Neighborhood': neighs.Stadtteil, 'Latitude': latitudes, 'Longitude': longitudes })

In [12]:
fra_df

Unnamed: 0,Neighborhood,Latitude,Longitude
0,Altstadt,50.110442,8.682901
1,Innenstadt,50.112993,8.674341
2,Bahnhofsviertel,50.107741,8.668676
3,Westend-Süd,50.115245,8.66227
4,Westend-Nord,50.126356,8.667921
5,Nordend-West,50.124914,8.67795
6,Nordend-Ost,50.12492,8.692317
7,Ostend,50.115935,8.720546
8,Bornheim,50.133056,8.714932
9,Gutleutviertel,50.097925,8.648964


In [13]:
# Folium is used to provide a map visual of the neighborhoods and later the neighboorhood clusters. 
import folium

In [15]:
lat = 50.116667
long = 8.683333
map_frankfurt = folium.Map(location=[lat, long], zoom_start=12)

for lat, long, neighborhood in zip(fra_df['Latitude'], fra_df['Longitude'], fra_df['Neighborhood']):
    label = '{}'.format(neighborhood)
    label = folium.Popup(label, parse_html=True)
    folium.vector_layers.CircleMarker([lat, long], 
                        radius=5, 
                       popup=label,
                       color='blue',
                       fill=True, 
                       fill_color='3186cc', 
                       fill_opacity=0.7, 
                       parse_html=False).add_to(map_frankfurt)

map_frankfurt

### Foursquare API
Now using the Foursquare API to get venue listings. 

In [16]:
CLIENT_ID = 'Removed for security and privacy' 
CLIENT_SECRET = 'Removed for security and privacy'
VERSION = '20180605'

In [17]:
LIMIT = 100
radius = 500

In [18]:
# Creating a function to query venues for all neighborhoods using the Foursquare API
def getVenues(names, latitudes, longitudes, radius=500):
    
    venues_list=[]
    for name, lat, lng in zip(names, latitudes, longitudes):
        print(name)
            
        # create the API request URL
        url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
            CLIENT_ID, 
            CLIENT_SECRET, 
            VERSION, 
            lat, 
            lng, 
            radius, 
            LIMIT)
            
        # make the GET request
        results = requests.get(url).json()["response"]['groups'][0]['items']
        
        # return only relevant information for each nearby venue
        venues_list.append([(
            name, 
            lat, 
            lng, 
            v['venue']['name'], 
            v['venue']['location']['lat'], 
            v['venue']['location']['lng'],  
            v['venue']['categories'][0]['name']) for v in results])

    nearby_venues = pd.DataFrame([item for venue_list in venues_list for item in venue_list])
    nearby_venues.columns = ['Neighborhood', 
                  'Neighborhood Latitude', 
                  'Neighborhood Longitude', 
                  'Venue', 
                  'Venue Latitude', 
                  'Venue Longitude', 
                  'Venue Category']
    
    return(nearby_venues)

In [21]:
# Import libraries needed for interaction with the Foursquare API
import json # library to handle JSON files
import requests # library to handle requests
from pandas.io.json import json_normalize # tranform JSON file into a pandas dataframe

# Matplotlib and associated plotting modules
import matplotlib.cm as cm
import matplotlib.colors as colors

In [22]:
fra_venues = getVenues(names=fra_df['Neighborhood'],
                                   latitudes=fra_df['Latitude'],
                                   longitudes=fra_df['Longitude']
                                  )


Altstadt
Innenstadt
Bahnhofsviertel
Westend-Süd
Westend-Nord
Nordend-West
Nordend-Ost
Ostend
Bornheim
Gutleutviertel
Gallus
Bockenheim
Sachsenhausen-Nord
Sachsenhausen-Süd
Flughafen
Oberrad
Niederrad
Schwanheim
Griesheim
Rödelheim
Hausen
Praunheim
Heddernheim
Niederursel
Ginnheim
Dornbusch


Exploring the dataframe of the Frankfurt venues per neighborhood. 

In [23]:
fra_venues.shape


(624, 7)

In [24]:
fra_venues.head()

Unnamed: 0,Neighborhood,Neighborhood Latitude,Neighborhood Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
0,Altstadt,50.110442,8.682901,Römerberg,50.110489,8.682131,Plaza
1,Altstadt,50.110442,8.682901,SCHIRN Kunsthalle,50.110291,8.683542,Art Museum
2,Altstadt,50.110442,8.682901,Dom Aussichtsplattform,50.110609,8.684908,Scenic Lookout
3,Altstadt,50.110442,8.682901,Weinterasse Rollanderhof,50.112473,8.682164,Wine Bar
4,Altstadt,50.110442,8.682901,Main,50.10839,8.682631,River


In [25]:
fra_venues.groupby('Neighborhood').count()

Unnamed: 0_level_0,Neighborhood Latitude,Neighborhood Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
Neighborhood,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
Altstadt,96,96,96,96,96,96
Bahnhofsviertel,100,100,100,100,100,100
Bockenheim,27,27,27,27,27,27
Bornheim,15,15,15,15,15,15
Dornbusch,5,5,5,5,5,5
Flughafen,16,16,16,16,16,16
Gallus,23,23,23,23,23,23
Ginnheim,10,10,10,10,10,10
Griesheim,8,8,8,8,8,8
Gutleutviertel,13,13,13,13,13,13


In [26]:
print('There are {} unique categories.'.format(len(fra_venues['Venue Category'].unique())))

There are 159 unique categories.


Observing a list of the venue categories below we can get a sense of the venue type: Restaurant as well as some venues of interest to us. Mainly venues of interest = heavily transited. 

In [27]:
fra_venues.groupby('Venue Category').count()

Unnamed: 0_level_0,Neighborhood,Neighborhood Latitude,Neighborhood Longitude,Venue,Venue Latitude,Venue Longitude
Venue Category,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
African Restaurant,1,1,1,1,1,1
Airport Lounge,2,2,2,2,2,2
Airport Service,2,2,2,2,2,2
American Restaurant,4,4,4,4,4,4
Apple Wine Pub,1,1,1,1,1,1
Art Museum,8,8,8,8,8,8
Asian Restaurant,9,9,9,9,9,9
Athletics & Sports,1,1,1,1,1,1
Austrian Restaurant,1,1,1,1,1,1
BBQ Joint,2,2,2,2,2,2


In [29]:
# Saving original fra_venues DataFrame to CSV in case use is needed later without having to query Foursquare API again

# fra_venues.to_csv(r'C:\Users\sergi\Desktop\IBM_DATA_SCIENCE\Capstone\fra_venues.csv', index = False, header=True)


In [30]:
# Extract Mexican Restaurants

mex_rest = fra_venues[fra_venues['Venue Category']=='Mexican Restaurant']

In [31]:
mex_rest

Unnamed: 0,Neighborhood,Neighborhood Latitude,Neighborhood Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
44,Altstadt,50.110442,8.682901,Tequila Cantina Y Bar,50.111681,8.678587,Mexican Restaurant
155,Innenstadt,50.112993,8.674341,Tequila Cantina Y Bar,50.111681,8.678587,Mexican Restaurant
223,Bahnhofsviertel,50.107741,8.668676,La Mex Lounge,50.108337,8.667939,Mexican Restaurant
270,Bahnhofsviertel,50.107741,8.668676,Yumas,50.10394,8.665302,Mexican Restaurant
360,Nordend-Ost,50.12492,8.692317,Fonda De Santiago,50.12533,8.697207,Mexican Restaurant


In [35]:
# Extract All Restaurants
restaurants = fra_venues[fra_venues['Venue Category'].str.contains(u'Restaurant')]

In [60]:
restaurants.head()

Unnamed: 0,Neighborhood,Neighborhood Latitude,Neighborhood Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
11,Altstadt,50.110442,8.682901,Paulaner am Dom,50.110876,8.685925,German Restaurant
12,Altstadt,50.110442,8.682901,Fisch Franke,50.112252,8.684247,Seafood Restaurant
19,Altstadt,50.110442,8.682901,Superkato,50.111664,8.679153,Sushi Restaurant
21,Altstadt,50.110442,8.682901,Góc Phố,50.113509,8.681686,Vietnamese Restaurant
23,Altstadt,50.110442,8.682901,Heimat – Essen und Weine,50.111125,8.678286,German Restaurant


In [37]:
rest_minimized = restaurants[restaurants['Venue Category'] != 'Mexican Restaurant' ]

In [61]:
rest_minimized.head()

Unnamed: 0,Neighborhood,Neighborhood Latitude,Neighborhood Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
11,Altstadt,50.110442,8.682901,Paulaner am Dom,50.110876,8.685925,German Restaurant
12,Altstadt,50.110442,8.682901,Fisch Franke,50.112252,8.684247,Seafood Restaurant
19,Altstadt,50.110442,8.682901,Superkato,50.111664,8.679153,Sushi Restaurant
21,Altstadt,50.110442,8.682901,Góc Phố,50.113509,8.681686,Vietnamese Restaurant
23,Altstadt,50.110442,8.682901,Heimat – Essen und Weine,50.111125,8.678286,German Restaurant


In [74]:
# Venue Categories reduced from the specific type of restaurant to just Restaurant
# To avoid the sparsity of the data for analysis
rest_mini = rest_minimized.copy()
rest_mini['Venue Category'] = 'Restaurant'

In [75]:
rest_mini

Unnamed: 0,Neighborhood,Neighborhood Latitude,Neighborhood Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
11,Altstadt,50.110442,8.682901,Paulaner am Dom,50.110876,8.685925,Restaurant
12,Altstadt,50.110442,8.682901,Fisch Franke,50.112252,8.684247,Restaurant
19,Altstadt,50.110442,8.682901,Superkato,50.111664,8.679153,Restaurant
21,Altstadt,50.110442,8.682901,Góc Phố,50.113509,8.681686,Restaurant
23,Altstadt,50.110442,8.682901,Heimat – Essen und Weine,50.111125,8.678286,Restaurant
27,Altstadt,50.110442,8.682901,Restaurant Medici,50.11175,8.678993,Restaurant
39,Altstadt,50.110442,8.682901,Picknickbank,50.111641,8.678415,Restaurant
40,Altstadt,50.110442,8.682901,Walden,50.111667,8.67825,Restaurant
42,Altstadt,50.110442,8.682901,Bistro Da Salvatore,50.109556,8.687952,Restaurant
45,Altstadt,50.110442,8.682901,Römer Pils Brunnen,50.11345,8.683873,Restaurant


### Preliminary mapping

Now using Folium a visual is created to see the locations of Mexican restaurants as well as for all restaurants. 

In [39]:
# Vizualize Mexican Restaurants 
for lat, long, neighborhood in zip(mex_rest['Venue Latitude'], mex_rest['Venue Longitude'], mex_rest['Venue']):
    label = '{}'.format(neighborhood)
    label = folium.Popup(label, parse_html=True)
    folium.vector_layers.CircleMarker([lat, long], 
                        radius=5, 
                       popup=label,
                       color='red',
                       fill=True, 
                       fill_color='red', 
                       fill_opacity=0.7, 
                       parse_html=False).add_to(map_frankfurt)

In [40]:
map_frankfurt

In [41]:
# Vizualize he rest of Restaurants 
for lat, long, neighborhood in zip(rest_minimized['Venue Latitude'], rest_minimized['Venue Longitude'], rest_minimized['Venue']):
    label = '{}'.format(neighborhood)
    label = folium.Popup(label, parse_html=True)
    folium.vector_layers.CircleMarker([lat, long], 
                        radius=5, 
                       popup=label,
                       color='green',
                       fill=True, 
                       fill_color='green', 
                       fill_opacity=0.7, 
                       parse_html=False).add_to(map_frankfurt)

In [42]:
map_frankfurt

Taking the above maps as a reference to where the existing Mexican Restaurants are located, as well as the relative overall restaurant density as compared to the different neighborhoods choosen, the next step will be to locate places of interest.

The places of interest will be defined by such venues identified by Foursquare as places were there are normally large crowds and/or places with a lot of people transiting such as: 

* Bus Stop
* Metro Station
* Hotel
* Park
* Plaza
* Theater

A dataframe denoting these places as well as the restaurants will be created in order to identify the different clusters using an unsupervised learning algorithm. 

In [45]:
# Reduce dataframe to include restaurants, metro/bus stations, parks, plazas

venues_interest = ['Bus Stop', 'Metro Station', 'Hotel', 'Park', 'Plaza', 'Theater']

reduced_venues = fra_venues[fra_venues['Venue Category'].isin(venues_interest)]

In [59]:
reduced_venues.head()

Unnamed: 0,Neighborhood,Neighborhood Latitude,Neighborhood Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
0,Altstadt,50.110442,8.682901,Römerberg,50.110489,8.682131,Plaza
17,Altstadt,50.110442,8.682901,Paulsplatz,50.11115,8.681589,Plaza
24,Altstadt,50.110442,8.682901,Liebfrauenberg,50.112654,8.681372,Plaza
57,Altstadt,50.110442,8.682901,Hotel Motel One Frankfurt-Römer,50.110259,8.678508,Hotel
64,Altstadt,50.110442,8.682901,Goetheplatz,50.112584,8.676767,Plaza


In [76]:
final_venues = pd.concat([reduced_venues, rest_mini, mex_rest])

In [77]:
final_venues.sort_values(by=['Neighborhood'], inplace=True)

In [78]:
final_venues.reset_index(drop=True)

Unnamed: 0,Neighborhood,Neighborhood Latitude,Neighborhood Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
0,Altstadt,50.110442,8.682901,Römerberg,50.110489,8.682131,Plaza
1,Altstadt,50.110442,8.682901,Weidenhof,50.114425,8.682191,Restaurant
2,Altstadt,50.110442,8.682901,Thai-Express,50.114186,8.683732,Restaurant
3,Altstadt,50.110442,8.682901,Zeil-Kitchen,50.114737,8.684614,Restaurant
4,Altstadt,50.110442,8.682901,Sky Lounge | Galeria Kaufhof,50.114282,8.679814,Restaurant
5,Altstadt,50.110442,8.682901,Klosterhof,50.109259,8.677224,Restaurant
6,Altstadt,50.110442,8.682901,Ramen Jun Red,50.112813,8.685973,Restaurant
7,Altstadt,50.110442,8.682901,Salzkammer,50.111557,8.678302,Restaurant
8,Altstadt,50.110442,8.682901,Restaurant China Garten,50.110308,8.678666,Restaurant
9,Altstadt,50.110442,8.682901,Centro Cultural Gallego,50.113678,8.687172,Restaurant


This concludes the data collection and data clean up part. 
### First Peer- Graded Assignment End.

The second part of the Capstone project is found below. 

## Methodology <a name="methodology"></a>

As a first step the necessary data was gathered, cleaned up and sorted. A preliminary visual analysis using **Folium** to create a labeled map with the location of existing restaurants in each neighborhood was generated. 

For the second step, heat maps will be generated to create a more compeling analysis for restaurant density in the different neighborhoods. Particularly compared with the location of our venues of interest, to see where the most transit occurs. 

The final step will be to use **K-Means clustering** to identify particular areas with distinct venue characteristics and superimpose them to our heat maps to choose the ideal location with the venues of interest and low restaurant density. 


## Analysis <a name="analysis"></a>

The heat map generation function of Folium will be used to get a visual of the restaurant density. 

In [66]:
restaurant_latlons = restaurants[['Venue Latitude', 'Venue Longitude']]

mexican_latlons = mex_rest[['Venue Latitude', 'Venue Longitude']]

In [70]:
from folium import plugins
from folium.plugins import HeatMap

heat_map_fra = folium.Map(location=[lat, long], zoom_start=12)
folium.TileLayer('cartodbpositron').add_to(heat_map_fra) #cartodbpositron cartodbdark_matter
HeatMap(restaurant_latlons).add_to(heat_map_fra)
for lat, long, neighborhood in zip(fra_df['Latitude'], fra_df['Longitude'], fra_df['Neighborhood']):
    label = '{}'.format(neighborhood)
    label = folium.Popup(label, parse_html=True)
    folium.vector_layers.CircleMarker([lat, long], 
                        radius=5, 
                       popup=label,
                       color='blue',
                       fill=True, 
                       fill_color='3186cc', 
                       fill_opacity=0.7, 
                       parse_html=False).add_to(heat_map_fra)

heat_map_fra

Most restaurants seem to be around the **Bahnhofsviertel**. 

Now let's visualize where the **Mexican restaurants** are located.

In [71]:
heat_map_mex = folium.Map(location=[lat, long], zoom_start=12)
folium.TileLayer('cartodbpositron').add_to(heat_map_mex) #cartodbpositron cartodbdark_matter
HeatMap(mexican_latlons).add_to(heat_map_mex)
for lat, long, neighborhood in zip(fra_df['Latitude'], fra_df['Longitude'], fra_df['Neighborhood']):
    label = '{}'.format(neighborhood)
    label = folium.Popup(label, parse_html=True)
    folium.vector_layers.CircleMarker([lat, long], 
                        radius=5, 
                       popup=label,
                       color='blue',
                       fill=True, 
                       fill_color='3186cc', 
                       fill_opacity=0.7, 
                       parse_html=False).add_to(heat_map_mex)

heat_map_mex

It seems that the existing **Mexican restaurants** are sparse and only found in four areas of Frankfurt. This looks very promising with respect to location of a new **Mexican restaurant**. 

When coupled with the previous **Restaurant heatmap** the areas of **Innenstadt** and **Altstadt** both of which are classified as downtown Frankfurt seem to be quite free of restaurant *overcrowding**. 

Further analysis now including the information of the venues of interest will be then used to narrow the options to select an optimum. 

In [79]:
# one hot encoding is needed for the ML model
fra_onehot = pd.get_dummies(final_venues[['Venue Category']], prefix="", prefix_sep="")

# add neighborhood column back to dataframe
fra_onehot['Neighborhood'] = final_venues['Neighborhood'] 

# move neighborhood column to the first column
fixed_columns = [fra_onehot.columns[-1]] + list(fra_onehot.columns[:-1])
fra_onehot = fra_onehot[fixed_columns]

fra_onehot.head()

Unnamed: 0,Neighborhood,Bus Stop,Hotel,Metro Station,Mexican Restaurant,Park,Plaza,Restaurant,Theater
0,Altstadt,0,0,0,0,0,1,0,0
82,Altstadt,0,0,0,0,0,0,1,0
78,Altstadt,0,0,0,0,0,0,1,0
73,Altstadt,0,0,0,0,0,0,1,0
63,Altstadt,0,0,0,0,0,0,1,0


In [80]:
fra_onehot.shape

(252, 9)

In [81]:
fra_grouped = fra_onehot.groupby('Neighborhood').mean().reset_index()
fra_grouped

Unnamed: 0,Neighborhood,Bus Stop,Hotel,Metro Station,Mexican Restaurant,Park,Plaza,Restaurant,Theater
0,Altstadt,0.0,0.037037,0.0,0.037037,0.0,0.222222,0.703704,0.0
1,Bahnhofsviertel,0.0,0.211538,0.0,0.038462,0.019231,0.019231,0.673077,0.038462
2,Bockenheim,0.1,0.1,0.0,0.0,0.0,0.1,0.7,0.0
3,Bornheim,0.0,0.142857,0.142857,0.0,0.0,0.0,0.714286,0.0
4,Dornbusch,0.0,0.0,0.0,0.0,0.5,0.0,0.5,0.0
5,Flughafen,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0
6,Gallus,0.0,0.090909,0.0,0.0,0.0,0.090909,0.727273,0.090909
7,Ginnheim,0.0,0.0,0.0,0.0,0.0,0.2,0.8,0.0
8,Griesheim,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0
9,Gutleutviertel,0.0,0.2,0.0,0.0,0.2,0.0,0.6,0.0


As can be seen above if all of the different types of restaurants were placed separately, a very sparse data frame would have resulted with many labels and therefore the clustering would have been impacted later on. Therefore, since of interest was just the amount of restaurants in each neighborhood as well as existing Mexican Restaurants these two labels were used. The shape of the final dataframe was therefore greately reduced as well. 

In [82]:
fra_grouped.shape

(26, 9)

### K Means Clustering

Five clusters was choosen given that there are only 26 Neighborhoods considered. Increasing the number of K-clusters would only give us then each neighborhood as its own clusters and lose the effectivity of the model. 

In [86]:
# import k-means from clustering stage
from sklearn.cluster import KMeans

# set number of clusters
kclusters = 5

fra_grouped_clustering = fra_grouped.drop('Neighborhood', 1)

# run k-means clustering
kmeans = KMeans(n_clusters=kclusters, random_state=0).fit(fra_grouped_clustering)

# check cluster labels generated for each row in the dataframe
kmeans.labels_

array([0, 4, 0, 2, 3, 1, 0, 0, 1, 4, 2, 4, 0, 1, 1, 1, 1, 0, 0, 1, 2, 4,
       1, 1, 1, 1])

In [88]:
# A final dataframe is developed to include the neighborhood coordinates and cluster labels
neigh_clusters = fra_df.copy()
neigh_clusters.insert(0, 'Cluster Labels', kmeans.labels_)
neigh_clusters

Unnamed: 0,Cluster Labels,Neighborhood,Latitude,Longitude
0,0,Altstadt,50.110442,8.682901
1,4,Innenstadt,50.112993,8.674341
2,0,Bahnhofsviertel,50.107741,8.668676
3,2,Westend-Süd,50.115245,8.66227
4,3,Westend-Nord,50.126356,8.667921
5,1,Nordend-West,50.124914,8.67795
6,0,Nordend-Ost,50.12492,8.692317
7,0,Ostend,50.115935,8.720546
8,1,Bornheim,50.133056,8.714932
9,4,Gutleutviertel,50.097925,8.648964


In [91]:
# Matplotlib and associated plotting modules
import matplotlib.cm as cm
import matplotlib.colors as colors



# set color scheme for the clusters
x = np.arange(kclusters)
ys = [i + x + (i*x)**2 for i in range(kclusters)]
colors_array = cm.rainbow(np.linspace(0, 1, len(ys)))
rainbow = [colors.rgb2hex(i) for i in colors_array]
  


In [92]:
rainbow

['#8000ff', '#00b5eb', '#80ffb4', '#ffb360', '#ff0000']

### Map of neighborhoods colorcoded by cluster

A map of the neighborhood centers is rendered with colorcoding for their specific cluster labels. 

In [117]:
# create map
map_clusters = folium.Map(location=[lat, long], zoom_start=12)

In [121]:
for lat, long, neighborhood, cluster in zip(neigh_clusters['Latitude'], neigh_clusters['Longitude'], neigh_clusters['Neighborhood'], neigh_clusters['Cluster Labels']):
    label = '{}: Cluster {}'.format(neighborhood, cluster)
    label = folium.Popup(label, parse_html=True)
    folium.vector_layers.CircleMarker([lat, long], 
                        radius=5, 
                       popup=label,
                       color=rainbow[cluster-1],
                       fill=True, 
                       fill_color=rainbow[cluster-1], 
                       fill_opacity=0.7, 
                       parse_html=False).add_to(map_clusters)

map_clusters

As can be seen from the different clusters, those neighborhoods that are around the downtown area all belong to one cluster giving us an indication of the saturation of venues of interest. 

In [126]:
fin_map_fra = folium.Map(location=[lat, long], zoom_start=12)
folium.TileLayer('cartodbpositron').add_to(heat_map_fra) #cartodbpositron cartodbdark_matter
HeatMap(restaurant_latlons).add_to(fin_map_fra)
for lat, long, neighborhood in zip(fra_df['Latitude'], fra_df['Longitude'], fra_df['Neighborhood']):
    label = '{}'.format(neighborhood)
    label = folium.Popup(label, parse_html=True)
    folium.vector_layers.CircleMarker([lat, long], 
                        radius=5, 
                       popup=label,
                       color='blue',
                       fill=True, 
                       fill_color='3186cc', 
                       fill_opacity=0.7, 
                       parse_html=False).add_to(fin_map_fra)

for lat, long, neighborhood, cluster in zip(neigh_clusters['Latitude'], neigh_clusters['Longitude'], neigh_clusters['Neighborhood'], neigh_clusters['Cluster Labels']):
    label = '{}: Cluster {}'.format(neighborhood, cluster)
    label = folium.Popup(label, parse_html=True)
    folium.vector_layers.CircleMarker([lat, long], 
                        radius=5, 
                       popup=label,
                       color=rainbow[cluster-1],
                       fill=True, 
                       fill_color=rainbow[cluster-1], 
                       fill_opacity=0.7, 
                       parse_html=False).add_to(fin_map_fra)

fin_map_fra

Superimposing the heat map to see how the clustering is affected also by the existence of restaurants one can see the relative saturation of restaurants in the downtown area. However, thorugh further inspections by zooming in to two areas of particular interest mainly: Innenstadt and Altstadt one can see that they belong to different clusters although a lot of the venues found in the area are very similar. This is due to the proximity of the Bahnhofviertel and the existence of a mexican restaurant close to the Altstadt region, as noted by the orange heat source. 

This leaves an option to go more for the Innenstadt, where there are patches and gaps for establishing a mexican restaurant. Although not belonging to a group where the venues of interest are higher (red label cluster). 

## Results and Discussion <a name="results"></a>

The analysis shows that there is a relatively low number of Mexican restaurants in Frankfurt. This results in a wide possibility for location for a opening a new Mexican Restaurant in many different neighborhoods. However, under further scrutiny as to the aspects that make several neighborhoods similar/equivalent as per K-Means Clustering method, it seems that there is an optimum close to the city center, Bahnhofviertel, Innenstadt, Altstadt. These three options seem plausible. 

When analyzed alongside the density of existing restaurants in general one can see that the Bahnhofviertel is already overcrowded with options, as well as having an existing Mexican restaurant specifically. When looking a bit further, the Altstadt, which belongs to the same cluster, seems also to be an alternative. There is however a Mexican restaurant already, but at the what appears to be the boundary of Altstadt and Innenstadt. Innenstadt, belonging to a different cluster, is however also an alternative. 

Other locations belonging to the same cluster are also available with no existence of direct competition as well as lower restaurant density. One drawback, however, is their relative distance to the city center, which could provide less customers, but also a possibility for lower rent cost. 


## Conclusion <a name="conclusion"></a>

The aim of this project was to be able to provide alternatives for stakeholders of the plausibility of opening a Mexican Restaurant in Frankfurt am Main, Germany. 

Through the use of K-mean clustering and visual aids (heat maps, map labels) different neighborhoods of Frankfurt can be established based on the venues located there and the existence of restaurants in the vicinity. The lack of Mexican restaurants in the different areas, as well as gaps in different areas with restaurants available make for a promising outlook. 

This study, however, only provides a preliminary analysis as further data would be needed for a final decision to be reached. Other factors such as rent costs, permits for venues in the city center, city ordinances, as well as conducted survey for interest in Mexican cuisine, would be needed an placed within an analysis matrix to be able to see if it would be profitable to open a Mexican Restaurant. 