# APPLIED DATA SCIENCE CAPSTONE PROJECT
## The Battle of Neighbourhoods

In this project I will be analyzing neighbourhoods of Basel City, Switzerland, and cluster them using k-Means clustering algorithm to identify those that would suit my taste for moving best.

Questions to be addressed:

1. What are the features I am looking for in the neighbourhood?
2. What kind of data is required?
3. Where to collect the data from? 
4. How to collect the data?


### 1. Download all necessary libraries

In [1]:
import numpy as np # library to handle data in a vectorized manner

import pandas as pd # library for data analsysis
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', None)

import json # library to handle JSON files

import requests # library to handle requests
from pandas.io.json import json_normalize # tranform JSON file into a pandas dataframe

# Matplotlib and associated plotting modules
import matplotlib.cm as cm
import matplotlib.colors as colors

# import k-means from clustering stage
from sklearn.cluster import KMeans

from bs4 import BeautifulSoup

print('Libraries imported.')

Libraries imported.


### 2. Collecting Basel data
Scraping the relevant web-page for Basel district data and creating a dataframe with Postal Codes and Quartieres' names.

In [2]:
url = 'https://www.plz-suche.org/basel-ch7874'

r = requests.get(url)
soup = BeautifulSoup(r.content, 'html5lib')

In [3]:
# finding the right table
table = soup.find('table', {'class': 'list-location tablesorter tablesorter-location'})

In [4]:
# scraping the table
table_contents=[]

for row in table.findAll('tr'):
    cell = []
    for td in row:
        try:
            cell.append(td.text.replace('\n', ''))
        except:
            continue
            
    if len(cell) > 0:
        table_contents.append(cell)
    
print(table_contents)

[['PLZ', 'Name', 'Typ', '\xa0'], ['4001-4051', 'Altstadt Grossbasel', 'Quartier', ''], ['4058', 'Altstadt Kleinbasel', 'Quartier', ''], ['4051-4056', 'Am Ring', 'Quartier', ''], ['4054', 'Bachletten', 'Quartier', ''], ['4052', 'Breite', 'Quartier', ''], ['4059', 'Bruderholz', 'Quartier', ''], ['4058', 'Clara', 'Quartier', ''], ['4054', 'Gotthelf', 'Quartier', ''], ['4053', 'Gundeldingen', 'Quartier', ''], ['4058', 'Hirzbrunnen', 'Quartier', ''], ['4055', 'Iselin', 'Quartier', ''], ['4057', 'Kleinhüningen', 'Quartier', ''], ['4057', 'Klybeck', 'Quartier', ''], ['4057', 'Matthäus', 'Quartier', ''], ['4058', 'Rosental', 'Quartier', ''], ['4052', 'Sankt Alban', 'Quartier', ''], ['4056', 'Sankt Johann', 'Quartier', ''], ['4051', 'Vorstädte', 'Quartier', ''], ['4058', 'Wettstein', 'Quartier', '']]


In [5]:
# creating a dataframe
df = pd.DataFrame(table_contents)
print(df.head(10))

           0                    1         2  3
0        PLZ                 Name       Typ   
1  4001-4051  Altstadt Grossbasel  Quartier   
2       4058  Altstadt Kleinbasel  Quartier   
3  4051-4056              Am Ring  Quartier   
4       4054           Bachletten  Quartier   
5       4052               Breite  Quartier   
6       4059           Bruderholz  Quartier   
7       4058                Clara  Quartier   
8       4054             Gotthelf  Quartier   
9       4053         Gundeldingen  Quartier   


In [6]:
# cleaning up the data
df.drop([2, 3], axis = 1, inplace = True)
df.drop(0, axis = 0, inplace = True)
df.columns = ['Postal Code', 'Quartiere']

In [7]:
print(df)
print('There are {} quartieres in Basel City.'. format(df.shape[0]))

   Postal Code            Quartiere
1    4001-4051  Altstadt Grossbasel
2         4058  Altstadt Kleinbasel
3    4051-4056              Am Ring
4         4054           Bachletten
5         4052               Breite
6         4059           Bruderholz
7         4058                Clara
8         4054             Gotthelf
9         4053         Gundeldingen
10        4058          Hirzbrunnen
11        4055               Iselin
12        4057        Kleinhüningen
13        4057              Klybeck
14        4057             Matthäus
15        4058             Rosental
16        4052          Sankt Alban
17        4056         Sankt Johann
18        4051            Vorstädte
19        4058            Wettstein
There are 19 quartieres in Basel City.


Their postal codes are not unique, as they overlap over neighbouring districts. 

Let's collect the latitude and longitude values for each quartiere. This information will be required for obtaining venue information for each district.

In [13]:
#!conda install -c conda-forge geopy --yes # uncomment if needed
from geopy.geocoders import Nominatim # convert an address into latitude and longitude values

In [30]:
# collecting the geospacial data for the quartieres
geolocator = Nominatim(user_agent="ny_explorer")
coords = []

for quartiere in df['Quartiere']:
    latlon = {}
    address = quartiere + ', Basel, Switzerland'
    location = geolocator.geocode(address)
    latlon['Quartiere'] = quartiere
    latlon['Latitude'] = location.latitude
    latlon['Longitude'] = location.longitude
    coords.append(latlon)

In [31]:
# converting coordinates data into a dataframe
coordinates = pd.DataFrame(coords)

In [32]:
basel_data = pd.merge(df, coordinates, left_on = 'Quartiere', right_on = 'Quartiere', how = 'left')
print(basel_data)

   Postal Code            Quartiere   Latitude  Longitude
0    4001-4051  Altstadt Grossbasel  47.556427   7.588259
1         4058  Altstadt Kleinbasel  47.560700   7.593382
2    4051-4056              Am Ring  47.558774   7.577477
3         4054           Bachletten  47.548566   7.571726
4         4052               Breite  47.551809   7.617853
5         4059           Bruderholz  47.530799   7.591624
6         4058                Clara  47.564085   7.596629
7         4054             Gotthelf  47.555819   7.570952
8         4053         Gundeldingen  47.543219   7.591485
9         4058          Hirzbrunnen  47.568873   7.615470
10        4055               Iselin  47.562196   7.565999
11        4057        Kleinhüningen  47.583376   7.597574
12        4057              Klybeck  47.576798   7.590149
13        4057             Matthäus  47.567439   7.591540
14        4058             Rosental  47.567708   7.601491
15        4052          Sankt Alban  47.549565   7.605052
16        4056