# Project - Helsinki, Finland
## Price of living in regards to characteristics of neighborhoods

# Introduction/Business Problem

Apartment prices in Helsinki, Finland has been skyrocketing for the last few years with more people moving to the capital, but there are still noticable differences between neighborhoods in regards to price. 

The research is targeted towards newcomers who are moving to Helsinki for work to give insight in the characteristics of the neighborhoods in comparison to the price of living there. Using visualization one will also get an overview of how to price is changing depending on location.

I will use a k-means clustering method to cluster and compare neighborhoods depending on similiarities in venues, as well as compare the average price of living (2020 prices) between the clusters to see if the characteristics of neighborhood correlate with the price of living, or if it is based on geograpghic distance from the city center.



# Data and analysis


The venue data will be extracted using the Foursquare API. The venue data is not specific, since it will give and overview of neighborhood characteristics whether the venues are e.g. :

- Cafés
- Sport facilities
- Restaurants
- Parks

The neighboorhood data (Postcode, name and price) is extracted from the following website using webscraping with the BeautifulSoup library:

https://blok.ai/en/neighbourhoods/

The Geopy library allows me to extract the coordinates for each neighborhood as well. I will then use the folium library to plot the neighborhoods and their clusters. The visualization of price on the map itself is optional as I have been unable to find data with the exact boundaries (coordinates) of each neighborhood, but I am still aiming towards finding the data.

In the code below you can see the inital data extraction of neighborhood, price and coordinate data. (use for example https://nbviewer.jupyter.org/ to view the map)

In [1]:
#imports

import pandas as pd # library for data analsysis
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', None)

import numpy as np # library to handle data in a vectorized manner

from decouple import config # for API 

import requests # library to handle requests

!pip install folium 
import folium # import folium, map rendering library

!pip install bs4 
from bs4 import BeautifulSoup # import BeautifulSoup for webscraping

import geocoder # import geocoder for coordinates
from  geopy.geocoders import Nominatim



# Webscraping

Extracting housing price data of districs in Finland

In [2]:
website_url = requests.get('https://blok.ai/en/neighbourhoods/')
soup = BeautifulSoup(website_url.content,'html.parser')
# print(soup_data.prettify()) # uncomment to see raw html data

### Saving relevant headers for our research (Postcode, Neighborhood, Average price per square 2020). City column is also extracted for further cleaning purposes

In [3]:
headers = [c.get_text() for c in soup.find('thead').find_all('th')[2:6]]
headers

['Postcode', 'Neighborhood', 'City', 'Average price per square 2020']

### Extracting relevant data and saving it into a dataframe using headers as columns

In [4]:
table = soup.find('tbody').find_all('tr')
table_contents = [[cell.get_text(strip=True) for cell in row.find_all('td')[2:6]]
        for row in table]

df=pd.DataFrame(table_contents, columns=headers)

Displaying first five entries

In [5]:
df.head()

Unnamed: 0,Postcode,Neighborhood,City,Average price per square 2020
0,140,Kaivopuisto - Ullanlinna,Helsinki,8713
1,150,Eira - Hernesaari,Helsinki,8367
2,120,Punavuori,Helsinki,8160
3,180,Kamppi - Ruoholahti,Helsinki,8023
4,220,Jätkäsaari,Helsinki,7871


Displaying shape

In [6]:
df.shape

(846, 4)

## Cleaning data

Since dataframe contains all districs in Finland and we only want districs located in Helsinki, we will remove other data entries and city column.

In [7]:
helsinki_data = df.loc[df['City'] == 'Helsinki']
helsinki_data.drop(columns=['City'], axis = 1, inplace = True)
helsinki_data.reset_index(drop=True, inplace=True)
helsinki_data

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  return super().drop(


Unnamed: 0,Postcode,Neighborhood,Average price per square 2020
0,140,Kaivopuisto - Ullanlinna,8713
1,150,Eira - Hernesaari,8367
2,120,Punavuori,8160
3,180,Kamppi - Ruoholahti,8023
4,220,Jätkäsaari,7871
5,170,Kruununhaka,7841
6,130,Kaartinkaupunki,7825
7,580,Verkkosaari,7609
8,100,Helsinki Keskusta - Etu-Töölö,7575
9,260,Keski-Töölö,7384


Shape of new dataframe

In [8]:
helsinki_data.shape

(78, 3)

Adding coordinates

In [9]:
column_names = ['Postcode', 'Neighborhood', 'Average price', 'Latitude', 'Longitude'] 
neighborhoods = pd.DataFrame(columns=column_names)


# iterating through helsinki df (This may take a minute)
for i in range(len(helsinki_data)):

    postalCode = helsinki_data.iloc[i]['Postcode']
    neighborhood_name = helsinki_data.iloc[i]['Neighborhood']
    price = helsinki_data.iloc[i]['Average price per square 2020'] 

    # getting coordinates for each neighborhood
    geolocator = Nominatim(user_agent=config('USER_AGENT'))
    loc = geolocator.geocode('{}, Helsinki, Finland'.format(postalCode))
    latitude = loc.latitude
    longitude = loc.longitude

    # appending data to new df
    neighborhoods = neighborhoods.append({'Postcode': postalCode,
                                          'Neighborhood': neighborhood_name,
                                          'Average price': price,
                                          'Latitude': latitude,
                                          'Longitude': longitude}, ignore_index=True)


Manually Correcting some of the coordinates that were completely wrong

In [10]:
neighborhoods.at[neighborhoods.index[neighborhoods['Neighborhood'] == 'Herttoniemi'], ['Latitude', 'Longitude']] = 60.195415, 25.033302
neighborhoods.at[neighborhoods.index[neighborhoods['Neighborhood'] == 'Itä-Pasila'], ['Latitude', 'Longitude']] = 60.200041, 24.939573
neighborhoods.at[neighborhoods.index[neighborhoods['Neighborhood'] == 'Veräjämäki'], ['Latitude', 'Longitude']] = 60.227290, 24.972098
neighborhoods.at[neighborhoods.index[neighborhoods['Neighborhood'] == 'Puistola'], ['Latitude', 'Longitude']] = 60.271337, 25.045611
neighborhoods.at[neighborhoods.index[neighborhoods['Neighborhood'] == 'Suurmetsä'], ['Latitude', 'Longitude']] = 60.265790, 25.079472
neighborhoods.at[neighborhoods.index[neighborhoods['Neighborhood'] == 'Maununneva'], ['Latitude', 'Longitude']] = 60.244833, 24.898528
neighborhoods.at[neighborhoods.index[neighborhoods['Neighborhood'] == 'Kivihaka'], ['Latitude', 'Longitude']] = 60.210680, 24.903876
                                       

Check first five entries on new dataframe

In [11]:
neighborhoods.head()

Unnamed: 0,Postcode,Neighborhood,Average price,Latitude,Longitude
0,140,Kaivopuisto - Ullanlinna,8713,60.157935,24.952702
1,150,Eira - Hernesaari,8367,60.158939,24.938014
2,120,Punavuori,8160,60.163562,24.939202
3,180,Kamppi - Ruoholahti,8023,60.163576,24.917557
4,220,Jätkäsaari,7871,60.157433,24.917381


Check shape to see if still consistent

In [12]:
neighborhoods.shape

(78, 5)

In [13]:
# Creating a map of Toronto using latitude and longitude values
helsinki_map = folium.Map(location=[latitude, longitude], zoom_start=11)

# Adding markers to map
for lat, lng, label in zip(neighborhoods['Latitude'], neighborhoods['Longitude'], neighborhoods['Neighborhood']):
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=5,
        popup=label,
        color='blue',
        fill=True,
        fill_color='#3186cc',
        fill_opacity=0.7,
        parse_html=False).add_to(helsinki_map)  
    
helsinki_map