# Capstone Project - The Battle of Neighborhoods

### Applied Data Science Capstone by IBM/Coursera

This Jupyter Notebook contains the capstone project for the *Coursera* [IBM Data Science Specialization](https://www.coursera.org/specializations/ibm-data-science).

## Table of contents
* [Introduction: Business Problem](#introduction)
* [Data](#data)
* [Methodology](#methodology)
* [Results](#results)
* [Discussion](#discussion)
* [Conclusion](#conclusion)



## Introduction/Business Problem <a name="introduction"></a>

According to [the statistical office of the City of Zurich](https://www.stadt-zuerich.ch/content/dam/stzh/prd/Deutsch/Statistik/Publikationsdatenbank/jahrbuch/2017/Tabellen/T_JB_2017_10_2.xlsx), there are currently around 2200 restaurants listed in the City of Zurich. With around 400'000 inhabitants, this means that there is one restaurant for every 200 citizens, or put differently, this is comparable to having five restaurants in a small village of 1000 inhabitants. Unsurprisingly, it is estimated that around 62% of restaurants in Zurich, and particularly small ones, are in the red ([NZZ](https://www.nzz.ch/wirtschaft/jedes-zweite-restaurant-schreibt-verlust-1.18300150)). In order to be able to cover personnel and rental costs, stakeholders planning to open a restaurant in Zurich will have to make an educated decision on suitable locations that 1) attract a sufficient number of guests and 2) have affordable rental charges.

The aim of this project is to characterise the different neighborhoods in Zurich based on whether or not they may be a promising location for a new restaurant to be opened. The following aspects will be taken into account in the analysis:

* **The number of already existing restaurants in the vicinity**, which should be as low as possible in order to minimize the number of competitors.
* **The number of "friendly" businesses in the vicinity**, such as shopping facilities or bars, which increase the number of potential customers in the area.
* **The average rental prices in the area**, which should be as low as possible.

I will use machine learning to distinguish areas that meet the above-mentioned criteria from those that do not, and will provide a list of the most promising three areas as well as their scoring in terms of the selection criteria for stakeholders to make a final decision.


## Data <a name="data"></a>

The following data sources will be used for this analysis:

* There are 12 districts in Zurich ("Kreis"). Neighborhoods will be defined as sub-districts, of which there are 34 in Zurich (2-4 per district). The list of neighborhoods is provided by the statistical office of the City of Zurich (see [here](https://www.stadt-zuerich.ch/content/dam/stzh/prd/Deutsch/Statistik/Themen/Bevoelkerung/BEV390T3903_Bevoelkerung-nach-Alter-Stadtkreis-Stadtquartier.xlsx)).
* The location data on restaurants in Zurich will be extracted via the [Foursquare.com](www.foursquare.com) API, which, among others, provides information on the type of venues found in places all over the world, including restaurants, shopping facilities and entertainment venues. 
* Information on the average rental costs for commercial spaces per district are provided by the statistical office of the City of Zurich (see [here](https://www.stadt-zuerich.ch/content/dam/stzh/prd/Deutsch/Statistik/Themen/Bauen-Wohnen/Leerflaechen-nach-Nutzungsart-und-Quartier_Mietpreise-leer-stehender-Buero-Praxisflaechen.xlsx)). The numbers refer to Swiss Francs (CHF) per m<sup>2</sup> .
* Geospatial information (i.e. coordinates) of each neighborhood will be retrieved via reverse geocoding using the [geopy package](https://pypi.org/project/geopy/).





### Load district names and average office rent per district


In [118]:
# Load required packages
import pandas as pd

# Load data from file
df = pd.read_csv("Zurich_df.csv") # load from file
df.head()

Unnamed: 0,Kreis,Quartier,Miete
0,Kreis 1,Rathaus,540.0
1,Kreis 1,Hochschulen,540.0
2,Kreis 1,Lindenhof,540.0
3,Kreis 1,City,540.0
4,Kreis 2,Wollishofen,510.0


### Add geospatial data from geopy

In [119]:
# Load packages
from geopy.geocoders import Nominatim # convert an address into latitude and longitude values
from geopy.extra.rate_limiter import RateLimiter

# Prepare geolocator
geolocator = Nominatim(user_agent="zurich_person") # define user
geocode = RateLimiter(geolocator.geocode, min_delay_seconds=1)

# Make address into a string for geocode to find
df['address'] = df['Quartier'] + ", " + df['Kreis'] + ", Zürich, Switzerland"

# Locate all places
df['location'] = df['address'].apply(geocode)

# Extract longitude and latitude
df['lng'] = df['location'].apply(lambda x: x.longitude)
df['lat'] = df['location'].apply(lambda x: x.latitude)

# Clean up
df.drop(['address', 'location'], axis = 1, inplace = True)

In [120]:
# inspect dataframe
df.head()

Unnamed: 0,Kreis,Quartier,Miete,lng,lat
0,Kreis 1,Rathaus,540.0,8.544311,47.372649
1,Kreis 1,Hochschulen,540.0,8.548613,47.373846
2,Kreis 1,Lindenhof,540.0,8.540799,47.372996
3,Kreis 1,City,540.0,8.535346,47.372943
4,Kreis 2,Wollishofen,510.0,8.530708,47.342427


### Plot all neighborhoods in Zurich

In [121]:
# get center coordinates of Zurich
address = 'Zurich Hauptbahnhof, Zurich, Switzerland'
location = geolocator.geocode(address)
latitude = location.latitude
longitude = location.longitude
print("Zurich's coordinates are: ", longitude, latitude)

Zurich's coordinates are:  8.5393635 47.3781008


In [123]:
# Load package for maps
import folium # plotting library

# create map
f = folium.Figure(width = 800, height = 600)
neigh_map = folium.Map(location = [latitude, longitude], zoom_start = 12).add_to(f)

for lat, lng, label in zip(df.lat, df.lng, df.Quartier):
    folium.CircleMarker(
        [lat, lng],
        radius = 7,
        color = 'royalblue',
        popup = label,
        fill = True,
        fill_color = 'fuchsia',
        fill_opacity = 0.3
    
    ).add_to(neigh_map)

neigh_map

### Add information on venues

In [204]:
# Prepare information for API access
import requests # library to handle requests
from pandas import json_normalize

CLIENT_ID = 'TOPVCHKEI1GQK4T4IEL512EZNEAJT3MWXXUVGI12NL0CMIMI' # Foursquare ID
CLIENT_SECRET = '0HRUKCRKEKC3NUEY3LTIJSMT5KOCPJZEGU1ZVZ2Q5B1EZTRK' # Foursquare Secret
VERSION = '20200423' # Foursquare API version
radius = 500 # meters from center
intent = 'browse'

# Provide information on where to look for venues
districts = df['Kreis']
neighbs = df['Quartier']
rents = df['Miete']
longitudes = df['lng']
latitudes = df['lat']

In [193]:
# function that extracts the category of the venue (because sometimes it is empty)
def get_category_type(item):
    categories_list = item['categories']
        
    if len(categories_list) == 0:
        return "unknown"
    else:
        return categories_list[0]['name']

In [207]:
# Loop through all districts and add venues for each district
venues_list=[]

for dist, neigh, rent, lat, lng in zip(districts, neighbs, rents, latitudes, longitudes):
    
    # create API request URL
    url = 'https://api.foursquare.com/v2/venues/search?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&intent={}'.format(
        CLIENT_ID, 
        CLIENT_SECRET, 
        VERSION, 
        lat, 
        lng, 
        radius,
        intent)
    
    # make the GET request
    results = requests.get(url).json()['response']['venues']
    
    # retrieve relevant information from results into a list
    venues_list.append([(
        dist,
        neigh,
        rent, 
        lat, 
        lng, 
        v['name'], 
        v['location']['lat'], 
        v['location']['lng'],  
        get_category_type(v)) for v in results])
    
# convert list to dataframe and change column names
nearby_venues = pd.DataFrame([item for venue_list in venues_list for item in venue_list])
nearby_venues.columns = ['District',
                         'Neighborhood',
                         'Neighborhood Rent',
                         'Neighborhood Latitude', 
                         'Neighborhood Longitude', 
                         'Venue', 
                         'Venue Latitude', 
                         'Venue Longitude', 
                         'Venue Category']

### Remove venues with unknown category

In [214]:
final_df = nearby_venues[nearby_venues['Venue Category'] != 'unknown']
final_df.head()

Unnamed: 0,District,Neighborhood,Neighborhood Rent,Neighborhood Latitude,Neighborhood Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
0,Kreis 1,Rathaus,540.0,47.372649,8.544311,Oliver Twist Pub Zürich,47.37228,8.54427,Sports Bar
1,Kreis 1,Rathaus,540.0,47.372649,8.544311,Äss-Bar,47.372561,8.543693,Bakery
2,Kreis 1,Rathaus,540.0,47.372649,8.544311,Restaurant 1001,47.372974,8.543783,Falafel Restaurant
3,Kreis 1,Rathaus,540.0,47.372649,8.544311,Raclette Factory,47.372376,8.543813,Swiss Restaurant
4,Kreis 1,Rathaus,540.0,47.372649,8.544311,Zürich,47.373158,8.544117,Motel


This is the final dataframe. Let's go!

## Methodology <a name="methodology"></a>

## Results <a name="results"></a>

## Discussion <a name="discussion"></a>

## Conclusion <a name="conclusion"></a>