# Capstone Project - The Battle of the Neighborhoods

## Table of contents
* [Introduction: Business Problem](#introduction)
* [Data](#data)

## Introduction: Business Problem <a name="introduction"></a>

**Scenario**: You are a great data scientist with a love of beating your personal bests and currently live in Istanbul, Turkey. You receive a job offer from a significant company in Amsterdam, Netherlands. If you decide to accept the offer which is indisputable well opportunity for your career, you have to move to Amsterdam city.
<img src='ams-ist.jpg' alt='scenes of cities Amsterdam and Istanbul'></img>
In order to show your best performance, you want to keep adaptation period as short as possible. That's why you're planing to move to the neighbourhood which is similiar to your home town in terms of a venues.  

So in this capstone project we will try to find an optimal location for you. 
We will try to detect locations with turkish restaurants in vicinity. We will clearly state the best possible locations. Final location can be chosen by yourself based on distance to your job location.

We will use our data science powers to generate a few most promising neighborhoods based on this criteria.

## Data <a name="data"></a>

Based on definition of our problem, at the end of our analysis we would have answered following questions:
* How similar are Istanbul and Amsterdam city based on venues? 
* What is the best possible neighborhoods to be moved in?

We need two datasets. Each dataset should include at least following features;
* Districts
* Neighbourhoods
* Geographical coordinates (latitude, longitude)

In the website of amsterdam municipality there is neighbourhoods dataset, we will use it for our analysis. You can reach this data set with this link <a href='https://maps.amsterdam.nl/open_geodata/?k=198'>City of Amsterdam neighbourhoods dataset</a>
Unfortunately, there is no ready-to-use dataset for the city of Istanbul. We will scrape the required fetures from websites. 

Following data sources will be needed to extract/generate the required information:
* Geographical coordinates of neighbourhoods in Istanbul will be obtained using **Google Maps API geocoding** 
* Number of venues and their type and location in every neighborhood will be obtained using **Foursquare API**

### Initializing required libraries

In [1]:
import numpy as np
import pandas as pd
import matplotlib as mpl
import matplotlib.pyplot as plt
import seaborn as sns
import folium
from geopy.geocoders import Nominatim
import requests
from bs4 import BeautifulSoup
from tqdm import tqdm_notebook as tqdm # for progressbar
import matplotlib.cm as cm
import matplotlib.colors as colors

%matplotlib inline

### Amsterdam Dataset Preperation  

In [2]:
amsterdam_data = pd.read_csv('datasets/GEBIED_BUURTEN.csv', sep=';')
amsterdam_data.head()

Unnamed: 0,OBJECTNUMMER,Buurt_code,Buurt,Buurtcombinatie_code,Stadsdeel_code,Opp_m2,WKT_LNG_LAT,WKT_LAT_LNG,LNG,LAT,Unnamed: 10
0,1,F81d,Calandlaan/Lelylaan,F81,F,275360.0,"POLYGON((4.800801 52.355175,4.809055 52.356842...","POLYGON((52.355175 4.800801,52.356842 4.809055...",4.809697,52.355708,
1,2,F81e,Osdorp Zuidoost,F81,F,519366.0,"POLYGON((4.818583 52.357519,4.818622 52.356295...","POLYGON((52.357519 4.818583,52.356295 4.818622...",4.811344,52.353736,
2,3,F82a,Osdorp Midden Noord,F82,F,215541.0,"POLYGON((4.786657 52.362712,4.795326 52.364434...","POLYGON((52.362712 4.786657,52.364434 4.795326...",4.791792,52.362078,
3,4,F82b,Osdorp Midden Zuid,F82,F,258379.0,"POLYGON((4.788293 52.359736,4.796917 52.36148,...","POLYGON((52.359736 4.788293,52.36148 4.796917,...",4.793781,52.358838,
4,5,F82c,Zuidwestkwadrant Osdorp Noord,F82,F,240774.0,"POLYGON((4.790209 52.356207,4.799258 52.358027...","POLYGON((52.356207 4.790209,52.358027 4.799258...",4.795597,52.355523,


City of Amsterdam has 8 districts and 481 neigbourhoods in total.

In [3]:
len(amsterdam_data['Stadsdeel_code'].unique())

8

In [4]:
amsterdam_data['Buurt'].count().sum()

481

We will get only the columns we need for our analysis.

In [5]:
needed_columns = ['Buurt', 'Stadsdeel_code', 'LNG', 'LAT']
amsterdam_df = amsterdam_data.loc[:, needed_columns]
amsterdam_df.head()

Unnamed: 0,Buurt,Stadsdeel_code,LNG,LAT
0,Calandlaan/Lelylaan,F,4.809697,52.355708
1,Osdorp Zuidoost,F,4.811344,52.353736
2,Osdorp Midden Noord,F,4.791792,52.362078
3,Osdorp Midden Zuid,F,4.793781,52.358838
4,Zuidwestkwadrant Osdorp Noord,F,4.795597,52.355523


We will add names of districts. Also we will change the names of our columns. Finally, reorder columns and amsterdam dataset is ready to use.  

In [6]:
amsterdam_districts = {'B':'Westpoort', 'T':'Zuidoost', 'M':'Oost', 'A':'Centrum', 
                       'N':'Noord', 'F':'Nieuw-West', 'E':'West', 'K':'Zuid'}
amsterdam_df.replace({'Stadsdeel_code': amsterdam_districts}, inplace=True)

proper_column_names = {'Buurt':'Neighbourhood', 'Stadsdeel_code':'District', 'LNG':'Longitude', 'LAT':'Latitude'}
amsterdam_df.rename(columns=proper_column_names, inplace=True)

column_order = ['District', 'Neighbourhood', 'Latitude', 'Longitude']
amsterdam_df = amsterdam_df[column_order]
amsterdam_df.head()

Unnamed: 0,District,Neighbourhood,Latitude,Longitude
0,Nieuw-West,Calandlaan/Lelylaan,52.355708,4.809697
1,Nieuw-West,Osdorp Zuidoost,52.353736,4.811344
2,Nieuw-West,Osdorp Midden Noord,52.362078,4.791792
3,Nieuw-West,Osdorp Midden Zuid,52.358838,4.793781
4,Nieuw-West,Zuidwestkwadrant Osdorp Noord,52.355523,4.795597


### Istanbul Dataset Preperation

In [7]:
url = "https://en.wikipedia.org/wiki/List_of_neighbourhoods_of_Istanbul"
response = requests.get(url)
document = BeautifulSoup(response.content, 'html.parser')

We get our target div element.

In [8]:
target_div = document.select('div[class="mw-parser-output"]')[0]

Istanbul city has 39 districts and 783 neighbourhoods in total.

In [9]:
len(target_div.select('h3'))

39

In [10]:
len(target_div.select('ol li'))

783

Every h3 element involves the district name and each li element in ol tags represents neighbourhoods. We wrote a function to get values from the target.

In [11]:
def get_values(target):
    column_names = ['District', 'Neighbourhood']
    df = pd.DataFrame(columns=column_names)
    ix = 0
    for h3, ol in tqdm(zip(target.select('h3'), target.select('ol'))):
        for li in ol.select('li'):
            try:
                neigh = li.a.string
            except:
                neigh = li.string
            dist = h3.select('.mw-headline')[0].string
            df.loc[ix] = [dist, neigh] 
            ix += 1
    return df

In [12]:
istanbul_data = get_values(target_div)
istanbul_data.head()

HBox(children=(IntProgress(value=1, bar_style='info', max=1), HTML(value='')))




Unnamed: 0,District,Neighbourhood
0,Adalar,Burgazada
1,Adalar,Heybeliada
2,Adalar,Kınalıada
3,Adalar,Maden
4,Adalar,Nizam


Is there any Nan value in the dataframe?

In [13]:
istanbul_data.isna().any()

District         False
Neighbourhood    False
dtype: bool

When we look neighbourhood column, we see some of them involves both neighbourhood and district name with comma seperation.
We will get rid of these unnecessary district names.
Also we should remove special characters from 'District' column. Without it we may not show proper choropleth maps.

In [14]:
istanbul_data.loc[395:400]

Unnamed: 0,District,Neighbourhood
395,Fatih,"Yavuz Sultan Selim, Fatih"
396,Fatih,"Yedikule, Fatih"
397,Fatih,"Zeyrek, Fatih"
398,Gaziosmanpaşa,Bağlarbaşı
399,Gaziosmanpaşa,Barbaros Hayrettin Paşa
400,Gaziosmanpaşa,Fevzi Çakmak


In [15]:
istanbul_data.District.unique()

array(['Adalar', 'Arnavutköy', 'Ataşehir', 'Avcılar', 'Bağcılar',
       'Bahçelievler', 'Bakırköy', 'Başakşehir', 'Bayrampaşa', 'Beşiktaş',
       'Beykoz', 'Beylikdüzü', 'Beyoğlu', 'Büyükçekmece', 'Çatalca',
       'Çekmeköy', 'Esenler', 'Esenyurt', 'Eyüp', 'Fatih',
       'Gaziosmanpaşa', 'Güngören', 'Kadıköy', 'Kağıthane', 'Kartal',
       'Küçükçekmece', 'Maltepe', 'Pendik', 'Sancaktepe', 'Sarıyer',
       'Silivri', 'Sultanbeyli', 'Sultangazi', 'Şile', 'Şişli', 'Tuzla',
       'Ümraniye', 'Üsküdar', 'Zeytinburnu'], dtype=object)

In [16]:
destination = istanbul_data.loc[istanbul_data['Neighbourhood'].str.contains(','), 'Neighbourhood']
istanbul_data.loc[istanbul_data['Neighbourhood'].str.contains(','), 'Neighbourhood'] = destination.apply(lambda neigh: neigh.split(',')[0])
istanbul_districts = {'Arnavutköy':'Arnavutkoy', 'Ataşehir':'Atasehir', 'Avcılar':'Avcilar', 'Bağcılar': 'Bagcilar', 'Bahçelievler':'Bahcelievler',
                      'Bakırköy':'Bakirkoy', 'Başakşehir':'Basaksehir', 'Bayrampaşa':'Bayrampasa', 'Beşiktaş': 'Besiktas', 'Beylikdüzü':'Beylikduzu', 
                      'Beyoğlu':'Beyoglu', 'Büyükçekmece': 'Buyukcekmece', 'Çatalca':'Catalca', 'Çekmeköy':'Cekmekoy', 'Eyüp':'Eyup', 
                      'Gaziosmanpaşa':'Gaziosmanpasa', 'Güngören':'Gungoren', 'Kadıköy':'Kadikoy', 'Kağıthane':'Kagithane', 'Küçükçekmece':'Kucukcekmece', 
                      'Sarıyer':'Sariyer', 'Şile':'Sile', 'Şişli':'Sisli', 'Ümraniye':'Umraniye', 'Üsküdar':'Uskudar'}
istanbul_data.replace({'District': istanbul_districts}, inplace=True)
istanbul_data.loc[395:400]

Unnamed: 0,District,Neighbourhood
395,Fatih,Yavuz Sultan Selim
396,Fatih,Yedikule
397,Fatih,Zeyrek
398,Gaziosmanpasa,Bağlarbaşı
399,Gaziosmanpasa,Barbaros Hayrettin Paşa
400,Gaziosmanpasa,Fevzi Çakmak


Now, we should get coordinates of each neighbourhood. To do that, we will use **Google Maps API geocoding** recursively.

In [17]:
def get_coordinates(api_key, address, verbose=False):
    try:
        url = 'https://maps.googleapis.com/maps/api/geocode/json?key={}&address={}'.format(api_key, address)
        response = requests.get(url).json()
        if verbose:
            print('Google Maps API JSON result =>', response)
        results = response['results']
        geographical_data = results[0]['geometry']['location'] # get geographical coordinates
        lat = geographical_data['lat']
        lon = geographical_data['lng']
        return [lat, lon]
    except:
        return [None, None]

In [18]:
def get_all_coordinates(df, api_key):
    for index, (neigh, dist) in tqdm(enumerate(zip(df.Neighbourhood, df.District))):
        address = neigh + ', ' + dist + ', istanbul'
        lat, long = get_coordinates(api_key, address)
        df.loc[index, 'Latitude'] = lat
        df.loc[index, 'Longitude'] = long
    return df

API_KEY value was removed after getting the locations.

In [19]:
API_KEY = "???"

In [20]:
istanbul_df = get_all_coordinates(istanbul_data, API_KEY)

We will save our dataframe as csv file to make it available for later usage.

In [21]:
istanbul_df.to_csv('datasets/istanbul.csv', index=False)

In [22]:
istanbul_df.head()

Unnamed: 0,District,Neighbourhood,Latitude,Longitude
0,Adalar,Burgazada,40.88,29.066944
1,Adalar,Heybeliada,40.877974,29.095299
2,Adalar,Kınalıada,40.90907,29.053205
3,Adalar,Maden,40.85832,29.123072
4,Adalar,Nizam,40.863169,29.116381


Could we get all neighbourhood locations correctly? What is total number of Nan values for each column?

In [23]:
istanbul_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 783 entries, 0 to 782
Data columns (total 4 columns):
District         783 non-null object
Neighbourhood    783 non-null object
Latitude         730 non-null float64
Longitude        730 non-null float64
dtypes: float64(2), object(2)
memory usage: 24.5+ KB


In [24]:
istanbul_df.isnull().sum()

District          0
Neighbourhood     0
Latitude         53
Longitude        53
dtype: int64

We will drop the rows of the Nan value and we will continue to analysis with rest of them.

In [25]:
istanbul_df.dropna(axis=0, inplace=True)
istanbul_df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 730 entries, 0 to 782
Data columns (total 4 columns):
District         730 non-null object
Neighbourhood    730 non-null object
Latitude         730 non-null float64
Longitude        730 non-null float64
dtypes: float64(2), object(2)
memory usage: 28.5+ KB


Some neighbourhoods are very close each other. That's why may be some of them have the same lat and long values. Let's examine that is there any merged neighbourhood.

In [26]:
istanbul_df[istanbul_df.duplicated(['Latitude', 'Longitude'],keep=False)].shape

(26, 4)

In [27]:
istanbul_df.drop_duplicates(subset=['Latitude', 'Longitude'], inplace=True)
istanbul_df.shape

(713, 4)

### Visualisation of Datasets

In [28]:
lat_amsterdam, long_amsterdam = get_coordinates(API_KEY, 'Amsterdam, Netherlands')
lat_istanbul, long_istanbul = get_coordinates(API_KEY, 'Istanbul, Turkey')

In [29]:
print('The geograpical coordinate of Amsterdam are {}, {}.'.format(lat_amsterdam, long_amsterdam))

The geograpical coordinate of Amsterdam are 52.368, 4.9036.


In [30]:
map_amsterdam = folium.Map(location=[lat_amsterdam, long_amsterdam], zoom_start=12)
map_amsterdam
for lat, lng, dist, neigh in zip(amsterdam_df['Latitude'], amsterdam_df['Longitude'], amsterdam_df['District'], amsterdam_df['Neighbourhood']):
    label = '{}, {}'.format(neigh, dist)
    label = folium.Popup(label, parse_html=True)
    
    folium.CircleMarker(
        [lat, lng],
        radius=5,
        popup=label,
        color='#5199FF',
        fill=True,
        fill_color='#B7D4FF',
        fill_opacity=0.6,
        parse_html=True
        ).add_to(map_amsterdam)  
    
map_amsterdam

In [31]:
print('The geograpical coordinate of Istanbul are {}, {}.'.format(lat_istanbul, long_istanbul))

The geograpical coordinate of Istanbul are 41.0082, 28.9784.


In [32]:
map_istanbul = folium.Map(location=[lat_istanbul, long_istanbul], zoom_start=11)
for lat, lng, dist, neigh in zip(istanbul_df['Latitude'], istanbul_df['Longitude'], istanbul_df['District'], istanbul_df['Neighbourhood']):
    label = '{}, {}'.format(neigh, dist)
    label = folium.Popup(label, parse_html=True)
    
    folium.CircleMarker(
        [lat, lng],
        radius=5,
        popup=label,
        color='#5199FF',
        fill=True,
        fill_color='#B7D4FF',
        fill_opacity=0.6,
        parse_html=True
        ).add_to(map_istanbul)  
    
map_istanbul