# Coursera Capstone Project: The Battle of Neighborhoods - Week 1 & 2

## Week 1: Description of the problem as well as the data used for the project

In the past few weeks, we introduced the notion of Geolocation, spatial data analysis and clustering. Doing so enabled us to retrieve geographic data for a given location and segment it into subcategories according to pre-defined characteristics. Unsupervised learning through k-meaned clustering provided us with the information necessary to analyze and compare the cities of New York as well as Toronto based on existing venues taken from FourSquare. 

Throughout this period, we gathered information about the existence and prevalence of the institutional composition within each neighborhood, making it possible to group the dataset and, potentially, derive assumptions regarding the socio-economic as well as cultural reality each subregion shows. 

In a next step, we can use this knowledge to help us define and form decisions based on economic considerations. Importantly, we can use the data to assess potential outcomes of opening certain venues, such as restaurants, cafés or bars, within a given neighborhood. Based on the data, we can see which subregions potentially are economically saturated and in which potential demand may exist. 

To follow this idea, we assume that we are a medium-priced Japanese franchise chain operating in the food and beverage industry (comparable to the likes of Vapiano - the (now insolvent) German food chain that offered Italian food). As we already analyzed both New York and Toronto, we plan on opening the first hub in Helsinki, the capital of Finland. 

Such decisions bear a wide range of important considerations about economic and social variables, which are great in number and sometimes impossible to assess in a quantifiable manner. Although the list is certainly not complete, one can define the following considerations as fulfilment criteria in order to derive a potential verdict: 

1. The neighborhood or district cannot be saturated within the food or beverage market. This implies that we are required to find a region which either does not offer what the company is trying to introduce or that, although supply is given, demand for the product is still available. As we are unable to measure the latter (for now) we focus on the first. 

2. The neighborhood or district must be frequently visited. This implies that the region should be located in an amusement area which is frequented preferably by both, the local consumers as well as tourists. This can be measured by analyzing the density of restaurants, bars and other venues as well as tourist attractions. 

3. The price level of our offering must suit the average income for the respective region. Especially, we cannot introduce an offering with prices highly above the paying ability of the societal environment. Although this is especially hard to measure, a potential solution may lie in the analysis of average rental prices, if available. Also, the availability of services such as uber or airbnb may lead to a better understanding of the respective socio-economic status of the individual regions. 

In order to make an adequate assessment, I will use the data from Finnish public sources as well as FourSquare to retrieve both the geographic composition as well as the venues. Further, I will create a data set in which each region shows average house prices and in which AirBnB is available (potentially also to what extent). We will use the Foursquare data to assess which regions are potentially already saturated for Chinese food by looking at both the existence as well as prevalence of certain food types and assess tourist hotspots by looking at the existence of attractions and general availability of venues. Further, we can assess the average house prices from local sources and also AirBnB as indicators defining average income. 

Once all factors are included, I will perform a k-means clustering analysis which shows the respective clusters in which an opening appears to be interesting and, if time permits, perform a more detailed analysis of the respective regions. Importantly we are looking for a region that is: 

1. Economically viable
2. Has a strong amusement area and is preferably located in a tourist area
3. Is frequently visited 
4. Has not already existing strong Japanese food scene
3. Has AirBnB available and fairly even rental prices 

In the end, I will deliver a graphical representation of the region and a clustering output on which I will base my assessment. 


## Week 2: Analysis and implementation of the code

### Part 1: Baseline commands

#### First, we will again load the distinctive packages and features into our lab: 

In [204]:
from bs4 import BeautifulSoup
import numpy as np
import pandas as pd
import folium
from sklearn.cluster import KMeans 
from sklearn.datasets.samples_generator import make_blobs
from sklearn.preprocessing import StandardScaler
import json 
import urllib.request
import requests
from geopy.geocoders import Nominatim # module to convert an address into latitude and longitude values

# libraries for displaying images
from IPython.display import Image 
from IPython.core.display import HTML 
    
# tranforming json file into a pandas dataframe library
from pandas.io.json import json_normalize

#### Then, we can get access to the FourSquare API of Berlin: 

We can get the Vienna based coordinates from the GitHub account of funkeinteraktiv: 

http://download.geonames.org/export/zip/

For which we will again use the QGis application to transform the coordinates into a geometric form of Berlin's geography. 

Unfortunately, these coordinates can only show the 22 boroughs (Bezirke) in Vienna, but don't show the individual neighborhoods each borough has. However, several aspects speak in favour of the borough-styled analysis: 

1. The data for rental prices is only available on borough basis
2. Econoomic factors are only available on borough basis 
3. Considering that the most promising boroughs (the more central and "trendier" boroughs) are maximum 5-9 square km in area and very well accessible by public transport, they potentially don't substantially differ in within characteristics 

In [150]:
vienna = pd.read_csv(r"/Users/nikolas.anic/Desktop/ML/Vienna.csv")

#### Now we will access the FourSquare API to get information of the venues in Berlin, by calling our function: 

In [151]:
def Vienna_venues(Borough, Latitude, Longitude, Postal_Code): 
    
    CLIENT_ID = 'JBREGZ4UNA53HX43WMAD4TQ2X2XJWMX5DPHEZEIZHQA0ACNP' # your Foursquare ID
    CLIENT_SECRET = 'VNS40KF3V4MGSWWAV0IGQINZIGIT1EQKNCWBFPOS3QF1JMOJ' # your Foursquare Secret
    VERSION = '20180605'
    LIMIT = 90
    radius = 500

    venues_list =  [] 
    
    for Bor, latitude, longitude, post in zip(Borough, Latitude, Longitude, Postal_Code): 
        
        url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
            CLIENT_ID, 
            CLIENT_SECRET, 
            VERSION, 
            latitude, 
            longitude, 
            radius, 
            LIMIT)
        
        venue = requests.get(url).json()["response"]['groups'][0]['items']
        
        venues_list.append([(Bor, 
                          latitude, 
                          longitude,
                          post,
                          v["venue"]["name"], 
                          v["venue"]["categories"][0]["name"],
                          v["venue"]["location"]["lat"],
                          v["venue"]["location"]["lng"]) for v in venue])
        
        pd_v = pd.DataFrame([item for venue_list in venues_list for item in venue_list])
        pd_v.columns = ['Neighborhood', 
                        'Neighborhood-Latitude', 
                        'Neighborhood-Longitude', 
                        'Postal Code',
                        'Venue', 
                        'Venue_Category',
                        'Venue_Latitude', 
                        'Venue_Longitude', 
                        ]
    return(pd_v)



In [152]:
Vienna = Vienna_venues(Postal_Code = vienna["Postal Code"],
Borough = vienna["Borough"],
Latitude = vienna["Latitude"],
Longitude = vienna["Longitude"])

In [153]:
Vienna

Unnamed: 0,Neighborhood,Neighborhood-Latitude,Neighborhood-Longitude,Postal Code,Venue,Venue_Category,Venue_Latitude,Venue_Longitude
0,Innere Stadt,48.2077,16.3705,1010,Stephansplatz,Plaza,48.208299,16.371880
1,Innere Stadt,48.2077,16.3705,1010,Stephansdom,Church,48.208626,16.372672
2,Innere Stadt,48.2077,16.3705,1010,Graben,Pedestrian Plaza,48.208915,16.369379
3,Innere Stadt,48.2077,16.3705,1010,COS,Clothing Store,48.209359,16.371591
4,Innere Stadt,48.2077,16.3705,1010,DO & CO Restaurant,Restaurant,48.208240,16.371758
...,...,...,...,...,...,...,...,...
613,Liesing,48.1433,16.2934,1230,Atzgersdorfer Platz,Plaza,48.146615,16.296017
614,Liesing,48.1433,16.2934,1230,Lichtenstöger,Austrian Restaurant,48.142088,16.295672
615,Liesing,48.1433,16.2934,1230,Quan Lounge,Asian Restaurant,48.141400,16.291759
616,Liesing,48.1433,16.2934,1230,Penny Markt,Grocery Store,48.145558,16.297634


#### Next, we can add important demographic as well as socio-economic characteristics per borough that are found in the official statistics of the city ov Vienna and merge both dataframes. This will supply us with a dataframe consisting of: 

1. Rental prices per sqm 
2. Growth of housing in the last decade
3. Two factors of AirBnB availability (if AirBnB is available at all (> 5 offers) and if it is commonly used (> 50 offers)
4. Gross income median
5. Google searches via a real estate platform, grouped into five bins 
6. An indicator if the area is considered a tourist area, defined by the city of Vienna, tourist department 
7. A survey response for the frequent availability of public transport 

#### Together with the data gathered from FourSquare, we can define individual boroughs according to their cuisine and venue availability as well as demographic characteristics. 

In [154]:
vienna_demographics = pd.read_csv(r"/Users/nikolas.anic/Desktop/ML/Vienna_Demographics.csv")

In [155]:
vienna_total = pd.merge(vienna, vienna_demographics, on = "Borough")
vienna_total

Unnamed: 0,Country,Postal Code,City,Borough,Latitude,Longitude,Rental Prices per sqm,Growth last decade %,AirBnb availability (>5),AirBnB availability (>50),Income Gross,Google searches for rental flat ranked,Tourist Area inidicator,Good Public Transport indicator %
0,Austria,1010,Wien,Innere Stadt,48.2077,16.3705,19.96,2.7,1,1,40116,5,1,91
1,Austria,1020,Wien,Leopoldstadt,48.2167,16.4,16.51,12.3,1,1,33189,4,1,63
2,Austria,1030,Wien,Landstrasse,48.1981,16.3948,16.59,8.6,1,1,35649,3,1,61
3,Austria,1040,Wien,Wieden,48.192,16.3671,16.36,9.2,1,1,38837,5,1,69
4,Austria,1050,Wien,Margareten,48.1865,16.3549,14.97,8.7,1,1,29306,3,0,58
5,Austria,1060,Wien,Mariahilf,48.1952,16.3503,16.23,8.3,1,1,35405,3,1,82
6,Austria,1070,Wien,Neubau,48.2,16.35,16.45,7.1,1,1,37601,4,1,95
7,Austria,1080,Wien,Josefstadt,48.2167,16.35,15.21,7.7,1,1,37745,4,1,97
8,Austria,1090,Wien,Alsergrund,48.2333,16.35,16.35,8.2,1,0,36738,4,0,92
9,Austria,1100,Wien,Favoriten,48.1521,16.3878,16.25,15.8,1,0,27246,1,1,63


In [149]:
Vienna['GrpIdx'] = Vienna['Neighborhood'].rank(method='dense').astype(int)
Vienna.sort_values("Neighborhood", inplace = True)

Unnamed: 0,Neighborhood,Neighborhood-Latitude,Neighborhood-Longitude,Postal Code,Venue,Venue_Category,Venue_Latitude,Venue_Longitude,GrpIdx
481,Alsergrund,48.2333,16.3500,1090,Steirerbeisel,Austrian Restaurant,48.228970,16.349176,1
452,Alsergrund,48.2333,16.3500,1090,Währinger Park,Park,48.232812,16.348353,1
453,Alsergrund,48.2333,16.3500,1090,Teka Sushi,Sushi Restaurant,48.233980,16.351905,1
454,Alsergrund,48.2333,16.3500,1090,SLUbar,Cocktail Bar,48.236248,16.349429,1
455,Alsergrund,48.2333,16.3500,1090,Mozart & Meisl,Gastropub,48.235467,16.348887,1
...,...,...,...,...,...,...,...,...,...
195,Wieden,48.1920,16.3671,1040,Allergiker Café,Gluten-free Restaurant,48.194632,16.367092,23
194,Wieden,48.1920,16.3671,1040,Tancredi,Diner,48.191492,16.365593,23
183,Wieden,48.1920,16.3671,1040,Suite Hotel 900m,Hotel,48.193517,16.367123,23
191,Wieden,48.1920,16.3671,1040,Corto e Nero,Café,48.193016,16.367073,23


#### We now quickly visualize the locations of all our venues obtained from FourSquare: 

In [207]:
address = 'Vienna, AT'

geolocator = Nominatim(user_agent="foursquare_agent") # call the geolocator 

location = geolocator.geocode(address)
latitude_vienna = location.latitude
longitude_vienna = location.longitude

vienna_map = folium.Map(location = [latitude_vienna, longitude_vienna], zoom_start = 12)

folium.features.CircleMarker(
    [latitude_vienna, longitude_vienna],
    radius=10,
    color='red',
    popup='District Center',
    fill = True,
    fill_color = 'red',
    fill_opacity = 0.6,
    
).add_to(vienna_map)

for lat, lng, label in zip(Vienna.Venue_Latitude, Vienna.Venue_Longitude, Vienna.Venue_Category):
    folium.features.CircleMarker(
        [lat, lng],
        radius=5,
        color='blue',
        fill = True,
        fill_color='blue',
        fill_opacity=0.6,
        popup=folium.Popup(label, parse_html=True)
    ).add_to(vienna_map)


vienna_map

#### Now, we can merge the venue and the total demographics list and retrieve the following list, which we indicate Vienna: 

In [157]:
Vienna = pd.merge(Vienna, vienna_total, on = "Postal Code")
Vienna

Unnamed: 0,Neighborhood,Neighborhood-Latitude,Neighborhood-Longitude,Postal Code,Venue,Venue_Category,Venue_Latitude,Venue_Longitude,Country,City,...,Latitude,Longitude,Rental Prices per sqm,Growth last decade %,AirBnb availability (>5),AirBnB availability (>50),Income Gross,Google searches for rental flat ranked,Tourist Area inidicator,Good Public Transport indicator %
0,Innere Stadt,48.2077,16.3705,1010,Stephansplatz,Plaza,48.208299,16.371880,Austria,Wien,...,48.2077,16.3705,19.96,2.7,1,1,40116,5,1,91
1,Innere Stadt,48.2077,16.3705,1010,Stephansdom,Church,48.208626,16.372672,Austria,Wien,...,48.2077,16.3705,19.96,2.7,1,1,40116,5,1,91
2,Innere Stadt,48.2077,16.3705,1010,Graben,Pedestrian Plaza,48.208915,16.369379,Austria,Wien,...,48.2077,16.3705,19.96,2.7,1,1,40116,5,1,91
3,Innere Stadt,48.2077,16.3705,1010,COS,Clothing Store,48.209359,16.371591,Austria,Wien,...,48.2077,16.3705,19.96,2.7,1,1,40116,5,1,91
4,Innere Stadt,48.2077,16.3705,1010,DO & CO Restaurant,Restaurant,48.208240,16.371758,Austria,Wien,...,48.2077,16.3705,19.96,2.7,1,1,40116,5,1,91
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
613,Liesing,48.1433,16.2934,1230,Atzgersdorfer Platz,Plaza,48.146615,16.296017,Austria,Wien,...,48.1433,16.2934,16.13,10.8,0,0,36753,5,0,15
614,Liesing,48.1433,16.2934,1230,Lichtenstöger,Austrian Restaurant,48.142088,16.295672,Austria,Wien,...,48.1433,16.2934,16.13,10.8,0,0,36753,5,0,15
615,Liesing,48.1433,16.2934,1230,Quan Lounge,Asian Restaurant,48.141400,16.291759,Austria,Wien,...,48.1433,16.2934,16.13,10.8,0,0,36753,5,0,15
616,Liesing,48.1433,16.2934,1230,Penny Markt,Grocery Store,48.145558,16.297634,Austria,Wien,...,48.1433,16.2934,16.13,10.8,0,0,36753,5,0,15


#### We now have to clean the dataset and delete some doubled or unnecessary columns: 

In [160]:
Vienna.drop(["Borough", "Longitude", "Country", "City", "Latitude"], axis = 1, inplace = True)

In [161]:
Vienna

Unnamed: 0,Neighborhood,Neighborhood-Latitude,Neighborhood-Longitude,Postal Code,Venue,Venue_Category,Venue_Latitude,Venue_Longitude,Rental Prices per sqm,Growth last decade %,AirBnb availability (>5),AirBnB availability (>50),Income Gross,Google searches for rental flat ranked,Tourist Area inidicator,Good Public Transport indicator %
0,Innere Stadt,48.2077,16.3705,1010,Stephansplatz,Plaza,48.208299,16.371880,19.96,2.7,1,1,40116,5,1,91
1,Innere Stadt,48.2077,16.3705,1010,Stephansdom,Church,48.208626,16.372672,19.96,2.7,1,1,40116,5,1,91
2,Innere Stadt,48.2077,16.3705,1010,Graben,Pedestrian Plaza,48.208915,16.369379,19.96,2.7,1,1,40116,5,1,91
3,Innere Stadt,48.2077,16.3705,1010,COS,Clothing Store,48.209359,16.371591,19.96,2.7,1,1,40116,5,1,91
4,Innere Stadt,48.2077,16.3705,1010,DO & CO Restaurant,Restaurant,48.208240,16.371758,19.96,2.7,1,1,40116,5,1,91
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
613,Liesing,48.1433,16.2934,1230,Atzgersdorfer Platz,Plaza,48.146615,16.296017,16.13,10.8,0,0,36753,5,0,15
614,Liesing,48.1433,16.2934,1230,Lichtenstöger,Austrian Restaurant,48.142088,16.295672,16.13,10.8,0,0,36753,5,0,15
615,Liesing,48.1433,16.2934,1230,Quan Lounge,Asian Restaurant,48.141400,16.291759,16.13,10.8,0,0,36753,5,0,15
616,Liesing,48.1433,16.2934,1230,Penny Markt,Grocery Store,48.145558,16.297634,16.13,10.8,0,0,36753,5,0,15


#### We can now define indicator variables which tell us what type of venue the individual borough, or neighborhood, has: 

In [162]:
vienna_coordinates = "/Users/nikolas.anic/Desktop/ML/GeoJSON/Vienna.geojson"

#### What also interests us is which neighborhoods already an Asian cuisine exists. If already many Asian restaurants are operating within a given neighborhood, chances are higher that demand is already saturated. 

Doing so requires us to create a dummy variable that indicates 1 if the respective neighborhood has any type of Asian restaurant operating. Then we can extract all Neighborhoods for which the condition is true and assign a new dummy to match for each neighborhood with a 1 if an Asian cuisine is currently operating within and a 0 otherwise. 

In [202]:
Vienna["Asian_restaurants_available"] = (Vienna["Venue_Category"].isin(["Chinese Restaurant", "Asian Restaurant", "Japanese Restaurant", "Sushi Restaurant"])).astype(int)
Vienna_asia_neighborhoods = Vienna.loc[Vienna["Asian_restaurants_available"]  == 1]["Neighborhood"]
Vienna_asia_neighborhoods

89               Innere Stadt
138              Leopoldstadt
184               Landstrasse
190                    Wieden
195                    Wieden
205                    Wieden
230                    Wieden
231                    Wieden
232                    Wieden
246                    Wieden
298                 Mariahilf
318                 Mariahilf
322                 Mariahilf
346                    Neubau
459                Josefstadt
462                Josefstadt
467                Alsergrund
522     Rudolfsheim-Fuenfhaus
528     Rudolfsheim-Fuenfhaus
532     Rudolfsheim-Fuenfhaus
615                   Liesing
Name: Neighborhood, dtype: object

In [200]:
Vienna["Asian_cuisine_available"] = (Vienna["Neighborhood"].isin(Vienna_asia_neighborhoods)).astype(int)

In [201]:
Vienna

Unnamed: 0,Neighborhood,Neighborhood-Latitude,Neighborhood-Longitude,Postal Code,Venue,Venue_Category,Venue_Latitude,Venue_Longitude,Rental Prices per sqm,Growth last decade %,AirBnb availability (>5),AirBnB availability (>50),Income Gross,Google searches for rental flat ranked,Tourist Area inidicator,Good Public Transport indicator %,Asian_restuaratns_available,Asian_restuarants_available,Asian_restaurants_available,Asian_cuisine_available
0,Innere Stadt,48.2077,16.3705,1010,Stephansplatz,Plaza,48.208299,16.371880,19.96,2.7,1,1,40116,5,1,91,0,0,0,1
1,Innere Stadt,48.2077,16.3705,1010,Stephansdom,Church,48.208626,16.372672,19.96,2.7,1,1,40116,5,1,91,0,0,0,1
2,Innere Stadt,48.2077,16.3705,1010,Graben,Pedestrian Plaza,48.208915,16.369379,19.96,2.7,1,1,40116,5,1,91,0,0,0,1
3,Innere Stadt,48.2077,16.3705,1010,COS,Clothing Store,48.209359,16.371591,19.96,2.7,1,1,40116,5,1,91,0,0,0,1
4,Innere Stadt,48.2077,16.3705,1010,DO & CO Restaurant,Restaurant,48.208240,16.371758,19.96,2.7,1,1,40116,5,1,91,0,0,0,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
613,Liesing,48.1433,16.2934,1230,Atzgersdorfer Platz,Plaza,48.146615,16.296017,16.13,10.8,0,0,36753,5,0,15,0,0,0,1
614,Liesing,48.1433,16.2934,1230,Lichtenstöger,Austrian Restaurant,48.142088,16.295672,16.13,10.8,0,0,36753,5,0,15,0,0,0,1
615,Liesing,48.1433,16.2934,1230,Quan Lounge,Asian Restaurant,48.141400,16.291759,16.13,10.8,0,0,36753,5,0,15,1,1,1,1
616,Liesing,48.1433,16.2934,1230,Penny Markt,Grocery Store,48.145558,16.297634,16.13,10.8,0,0,36753,5,0,15,0,0,0,1


#### As we now analyzed which neighborhoods are already offering Asian cuisine, we can start making clusters