# Introduction
This notebook series is created to play around with booli data, i.e. housing market data. The focus is on Stockholm inner city with the addition Gröndal since that area is of special interest. This first notebook is only about collecting data and cleaning it, make it usable.

In [240]:
# Do the imports
import matplotlib.pyplot as plt
%matplotlib inline
import http.client
from urllib.parse import urlencode, quote
import time
import datetime
from hashlib import sha1
import random
import string
import os
import sys
import urllib as ul
import json
import numpy as np
import seaborn as sns
import numpy as np
import pandas as pd
from geopy.geocoders import Nominatim
from IPython.core.display import display, HTML, Image
Image(url='https://bcdn.se/images/resources/booli_logo.png')

# Data Collection/Preparation

### Set variables

In [241]:
district =      ['Stockholm innerstad','Gröndal']
startDate =     '2016-01-01'
endDate =       datetime.datetime.now().strftime('%Y-%m-%d')
callerId =      'caller'
privateKey =    'key'
#minLivingArea = 50
#maxLivingArea = 70

In [242]:
# Must have Booli authentification 
timestamp = str(int(time.time()))
unique = ''.join(random.choice(string.ascii_uppercase + string.digits) for x in range(16))
hashstr = sha1((callerId+timestamp+privateKey+unique).encode('utf-8')).hexdigest()

## Get the data
Open the connection and loop through the areas. Can only get 1000 objects on each call, so an offset must be set to get the next 1000 objects etc..

In [243]:
connection = http.client.HTTPConnection("api.booli.se")
result = []
limit = 1000
for dist in district:
    print('Collect data for: ', dist)
    MO = True
    objects = 0
    offset = 0
    while MO==True:
        print('limit:', limit, 'offset: ',offset)
        url = ("/sold?q="+quote(dist)+"&"
               "minSoldDate="+startDate+"&"
               "maxSoldDate="+endDate+"&"
               #"minLivingArea="+str(minLivingArea)+"&"
               #"maxLivingArea="+str(maxLivingArea)+"&"
               "limit="+str(limit)+"&"+
               "offset="+str(offset)+"&"
               "callerId="+callerId+"&time="+timestamp+"&unique="+unique+"&hash="+hashstr)
        connection.request("GET", url)
        response = connection.getresponse()
        if response.status != 200:
            print("fail")
        else:
            data = response.read().decode('utf8')
            result.append(json.loads(data))
            print('objects added:', result[-1]['count'])
        objects = objects + limit
        if objects > result[-1]['totalCount']:
            MO=False
            print('all objects added: ',len(result), ', totalCount: ',result[-1]['totalCount'])
        else:
            print('adjusting offset')
            offset = offset + limit
            time.sleep(0.5)
connection.close()

Collect data for:  Stockholm innerstad
limit: 1000 offset:  0
objects added: 1000
adjusting offset
limit: 1000 offset:  1000
objects added: 1000
adjusting offset
limit: 1000 offset:  2000
objects added: 1000
adjusting offset
limit: 1000 offset:  3000
objects added: 1000
adjusting offset
limit: 1000 offset:  4000
objects added: 1000
adjusting offset
limit: 1000 offset:  5000
objects added: 1000
adjusting offset
limit: 1000 offset:  6000
objects added: 1000
adjusting offset
limit: 1000 offset:  7000
objects added: 1000
adjusting offset
limit: 1000 offset:  8000
objects added: 356
all objects added:  9 , totalCount:  8356
Collect data for:  Gröndal
limit: 1000 offset:  0
objects added: 172
all objects added:  10 , totalCount:  172


In [244]:
# Merge all data into one dataframe
df = pd.DataFrame()
for res in result:
    df1 = pd.DataFrame(res['sold'])
    df = df.append(df1)
df = df.set_index('booliId',drop=False)
df_copy = df.copy()
df.info()
print('\nBooliId is an unique index:',df.index.is_unique)

<class 'pandas.core.frame.DataFrame'>
Int64Index: 8528 entries, 2220639 to 2012738
Data columns (total 17 columns):
additionalArea       212 non-null float64
booliId              8528 non-null int64
constructionYear     7691 non-null float64
floor                7868 non-null float64
isNewConstruction    129 non-null float64
listPrice            8461 non-null float64
livingArea           8518 non-null float64
location             8528 non-null object
objectType           8528 non-null object
plotArea             1292 non-null float64
published            8528 non-null object
rent                 8499 non-null float64
rooms                8523 non-null float64
soldDate             8528 non-null object
soldPrice            8528 non-null int64
source               8528 non-null object
url                  8528 non-null object
dtypes: float64(9), int64(2), object(6)
memory usage: 1.2+ MB

BooliId is an unique index: True


A few columns have many null elements, investigate those further, can we do something about them?

# Extract/Clean/Preprocess Data

Fields additionalArea and isNewConstruction, add 0 where info is missing.

In [245]:
# Expect additional Area to be 0 when info not provided.
df.loc[:,'additionalArea'] = df.loc[:,'additionalArea'].fillna(0)
df.loc[:,'isNewConstruction'] = df.loc[:,'isNewConstruction'].fillna(0)
#remove plot area since the info is sparse and not really interesting
df.drop('plotArea', axis=1, inplace=True)

Look at the field location, what do we have there?

In [246]:
df.loc[:,'location'].iloc[0]

{'address': {'streetAddress': 'Reimersholmsgatan 49'},
 'distance': {'ocean': 2968},
 'namedAreas': ['Södermalm'],
 'position': {'latitude': 59.31718088, 'longitude': 18.022151},
 'region': {'countyName': 'Stockholms län', 'municipalityName': 'Stockholm'}}

a dictionary with some interesting info, extract it and make it easier to work with

In [247]:
# Extract info from nestend dictionary
# Extract area and street address
namedAreas = []
streetAddress = []
count = 0
for i in df.loc[:,'location']:
    try:
        namedAreas.append(i['namedAreas'][0])
    except:
        namedAreas.append('NULL')
    try:
        streetAddress.append(i['address']['streetAddress'])
    except:
        streetAddress.append('NULL')
    count = count + 1
df.loc[:,'namedAreas'] = namedAreas
df.loc[:,'streetAddress'] = streetAddress

In [248]:
# Extract street address name and street address number
streetAddressName = []
streetAddressNumber = []
import re
for s in df.loc[:,'streetAddress']:
    try:
        streetAddressName.append(re.findall(r"(.*)\s\d",s)[0])
    except:
        streetAddressName.append('NULL')
    try:
        streetAddressNumber.append(re.findall(r".*\s(\d*)",s)[0])
    except:
        streetAddressNumber.append('NULL')
df.loc[:,'streetAddressNumber'] = streetAddressNumber
df.loc[:,'streetAddressName'] = streetAddressName

In [249]:
# Add dist to ocean
ocean = []
for i in df.loc[:,'location']:
    try:
        ocean.append(int(i['distance']['ocean']))
    except:
        ocean.append(np.nan)
df.loc[:,'ocean'] = ocean

In [250]:
# Add coordinates
coordinates = []
for i in df.loc[:,'location']:
    try:
        coordinates.append(i['position'])
    except:
        coordinates.append(np.nan)
df.loc[:,'coordinates'] = coordinates

Have a look in the source field

In [251]:
df.loc[:,'source'].values[0]

{'id': 1130,
 'name': 'Innerstadsspecialisten AB',
 'type': 'Broker',
 'url': 'http://www.innerspec.se/'}

Ok, info on the Broker, interesting, extract it.

In [252]:
# Add broker 
broker = []
for i in df.loc[:,'source']:
    try:
        broker.append(i['name'])
    except:
        broker.append('NULL')
df.loc[:,'broker'] = broker

In [253]:
# Do datetime conversions and add some info on the sqm price
df.loc[:,'soldDate'] = pd.to_datetime(df.loc[:,'soldDate'])
df.loc[:,'soldMonth'] = df.loc[:,'soldDate'].dt.to_period('M')
df.loc[:,'soldPriceSqm'] = df.loc[:,'soldPrice']/df.loc[:,'livingArea']
df.loc[:,'listPriceSqm'] = df.loc[:,'listPrice']/df.loc[:,'livingArea']

Looking at the info on the data we can see that we have many NULL on constructionYear field. Maybe this info is available on other objects? I.e. multiple object with the same address. Same address should have the same construction year.

In [254]:
#Same address should have the same construction year, so set it where it's missing.
count = [0,0]
for i in df[df.loc[:,'constructionYear'].isnull()].loc[:,'streetAddress']:
    try:
        #Take the first value on matching address, even though the same address has different construction year (why is this?)
        new_constructionYear = df[(df.loc[:,'streetAddress']==i) & (df.loc[:,'constructionYear'].notnull())].loc[:,'constructionYear'].values[0]
        df.ix[(df.loc[:,'streetAddress']==i) & (df.loc[:,'constructionYear'].isnull()),'constructionYear'] = new_constructionYear
        count[0] = count[0] + 1
    except:
        count[1] = count[1] + 1
print(count[0],'addresses matched and new cunstruction year added')
print(count[1],'addresses not matched\n')
constructionYearRange = (int(np.nanmin(df.loc[:,'constructionYear'].values)), int(np.nanmax(df.loc[:,'constructionYear'].values)))
print('Oldest object:\t', constructionYearRange[0])
print('Newest object:\t', constructionYearRange[1])

540 addresses matched and new cunstruction year added
297 addresses not matched

Oldest object:	 1400
Newest object:	 2017


In [255]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 8528 entries, 2220639 to 2012738
Data columns (total 26 columns):
additionalArea         8528 non-null float64
booliId                8528 non-null int64
constructionYear       8231 non-null float64
floor                  7868 non-null float64
isNewConstruction      8528 non-null float64
listPrice              8461 non-null float64
livingArea             8518 non-null float64
location               8528 non-null object
objectType             8528 non-null object
published              8528 non-null object
rent                   8499 non-null float64
rooms                  8523 non-null float64
soldDate               8528 non-null datetime64[ns]
soldPrice              8528 non-null int64
source                 8528 non-null object
url                    8528 non-null object
namedAreas             8528 non-null object
streetAddress          8528 non-null object
streetAddressNumber    8528 non-null object
streetAddressName      8528 non-nu

Much better! Check the area field

# namedAreas

In [257]:
print('Nr Areas',len(df.loc[:,'namedAreas'].unique()))
print(df.loc[:,'namedAreas'].unique())

Nr Areas 176
['Södermalm' 'Kungsholmen' 'Södermalm Högalid' 'Vasastan' 'Östermalm'
 'Norra Djurgårdsstaden' 'Norra Djurgården' 'Kungsholmen-Hornsbergs Strand'
 'Årsta' 'Gärdet' 'Essingeöarna' 'Södermalm Norra Hammarbyhamnen'
 'Södermalm-Högalid' 'Ög' 'Hornstull' 'Södermalm Maria Magdalena'
 'Hammarby Sjöstad' 'Birkastan' 'Odenplan' 'Lyceum' 'Nedre Gärdet'
 'Södermalm Sofia' 'Södermalm Hornstull' 'Gamla Stan' 'Norrmalm Vasastan'
 'Katarina' 'Södermalm-Sofo' 'Södermalm-Katarina' 'Enskede-Årsta-Vantör'
 'Södermalm-Maria' 'Högalid' 'Vasastan Birkastan' 'Sofia'
 'Hornsbergs Strand' 'Norrmalm' 'Mälardalen' 'Södermalm Katarina'
 'Södermalm-Sofia' 'Östermalmstorg' 'Stocksholm' 'Allt Omedelbar Närhet'
 'Gårdshus' 'Lilla Essingen' 'Kungsholmen Fridhemsplan' 'Vasastan Odenplan'
 'Kungsholmen Thorildsplan' 'Danviksklippan' 'Centrum' 'Östra Södermalm'
 'Hjorthagen' 'Maria' 'Söder' 'Södermalm Mariatorget' 'Mariaberget'
 'Fridhemsplan' 'Birkastan Vasastan' 'Kungsholmen Fredhäll'
 'Vasastan Östermalm'

#### OMG, this is messed up!
Again, same address should lie in the same area, check if this is true.

In [258]:
# Check what address has multiple areas
streetAddress = df.groupby(['streetAddress','namedAreas'])['namedAreas'].count()
streetAddressAreas = streetAddress.unstack().count(axis=1)
ambiguousAddresses = streetAddressAreas[streetAddressAreas>1].index.values
print('Number of addresses with multiple areas:',streetAddressAreas[streetAddressAreas>1].count())
print('Average number of addresses for these:\t',round(streetAddressAreas[streetAddressAreas>1].mean(),3))

Number of addresses with multiple areas: 913
Average number of addresses for these:	 2.21


I.e. the same address lies in different areas, I guess this is due to that different brokers label the area differently.

How to deal with this? What is the correct area? This needs a lot of manual work

## Booli area adjustments
- For the known areas I adjust the name manually, e.g. misspelled or spelled differently.
- For the incorrect/unknown area names I look if that address is somewhere else and set the area accordingly.

In [259]:
# Which areas has very few objects? This can be a small area or a misspelled one.
threshold = 10
agg = {'namedAreas':{'count':'count'},
       'soldPriceSqm':{'mean':'mean'},
       'soldPrice':{'mean':'mean'},
       'listPrice':{'mean':'mean'}}
namedAreas = df.groupby('namedAreas').agg(agg)
namedAreas.index.values
print('list of areas with below',threshold,'ojects')
namedAreas.head()
print('Nr Areas: ',len(namedAreas[namedAreas.loc[:,('namedAreas','count')]<threshold].index.values))
print(namedAreas[namedAreas.loc[:,('namedAreas','count')]<threshold].index.values)

list of areas with below 10 ojects
Nr Areas:  122
['Allt Omedelbar Närhet' 'Atlas' 'Atlas Vasastan' 'Börja'
 'Centrum-Norrmalm-Vasastan' 'Danviksklippan' 'Djurgården'
 'Ekhagen Djurgården Östermalm' 'Entréplan' 'Fredhäll-Kungsholmen'
 'Fridhemsplan' 'Gröndal Ekensberg' 'Gärdet Östemalm' 'Gärdet Östermalm'
 'Gärdet-Östermalm' 'Gårdshus' 'Hammarby Sjöstad Såld' 'Högalid Södermalm'
 'Högalid-Tanto' 'Karlaplan' 'Katarina-Sofo' 'Katarina/Sofia'
 'Kungsholmen Fredhäll' 'Kungsholmen Kristineberg'
 'Kungsholmen Kungsholms Strand' 'Kungsholmen Lilla Essingen'
 'Kungsholmen Lindhagen' 'Kungsholmen Marieberg' 'Kungsholmen Nedre'
 'Kungsholmen Norr Mälarstrand' 'Kungsholmen Rådhuset'
 'Kungsholmen Sankt Eriksområdet' 'Kungsholmen Stadshagen'
 'Kungsholmen Stora Essingen' 'Kungsholmen Såld' 'Kungsholmen-Fridhemsplan'
 'Kungsholmen-Hornbergs Strand' 'Kungsholmen-Kristinebergs Stra'
 'Kungsholmstorg' 'Kvm' 'Liljeholmen' 'Lyceum' 'Lärkstaden' 'Lärkstan'
 'Mariaberget-Södermalm' 'Marieberg' 'Medborgarp

In [260]:
rep_namedAreas = {'Atlas':'Vasastan',
                  'Atlas Vasastan':'Vasastan',
                  'Centrum-Norrmalm-Vasastan':'Centrum Norrmalm Vasastan',
                  'Birkastan Vasastan':'Vasastan',
                  'Medborgarplatsen-Södermalm':'Medborgarplatsen',
                  'Gärdet Östemalm':'Gärdet',
                  'Gärdet Östermalm':'Gärdet',
                  'Gärdet-Östermalm':'Gärdet',
                  'Maria':'Södermalm Maria',
                  'Nedre Gärdet Såld':'Gärdet',
                  'Katarina':'Södermalm Katarina',
                  'Kungsholmen Stora Essingen':'Stora Essingen',
                  'Kungsholmen Lilla Essingen':'Lilla Essingen',
                  'Lärkstan':'Lärkstaden',
                  'Rödabergen':'Vasastan',
                  'Södermalm-Maria':'Södermalm Maria',
                  'Söddermalm':'Södermalm',
                  'Söderlmalm':'Södermalm',
                  'Södermalm-Katarina-Sofo':'Södermalm Katarina',
                  'Sofia':'Södermalm Sofia',
                  'Hammarby Sjöstad Såld':'Hammarby Sjöstad',
                  'Vasastan Såld':'Vasastan',
                  'Vasastan Atlas':'Vasastan',
                  'Ög':'Gärdet', #Check this
                  'Öv':'Vasastan', #Check this
                  'Östermalm Såld':'Östermalm',
                  'Östermalm - Såld':'Östermalm'
                 }

err_namedAreas = ['test','Börja','Såld','Entréplan','Över','Kvm','Området Finn Gott Om Restauranger',
                  'NULL','Området','Cafeér','Området','Perfekt Naturnära Läge','SoFo','Allt Omedelbar Närhet']

# Do the replacements from dict.
for i in rep_namedAreas:
    df.ix[df['namedAreas']==i,'namedAreas'] = rep_namedAreas[i]
    
# Search for same address for the invalid names
count = [0,0]
for i in df[df.loc[:,'namedAreas'].isin(err_namedAreas)].loc[:,'streetAddress']:
    try:
        #Take the first value on matching address, even though the same address has different Areas
        old_namedAreas = df[(df.loc[:,'streetAddress']==i) & (df.loc[:,'namedAreas'].isin(err_namedAreas))].loc[:,'namedAreas'].values[0]
        new_namedAreas = df[(df.loc[:,'streetAddress']==i) & (~df.loc[:,'namedAreas'].isin(err_namedAreas))].loc[:,'namedAreas'].values[0]
        print(old_namedAreas,':',i,'->',new_namedAreas)
        df.ix[(df['streetAddress']==i) & (df.loc[:,'namedAreas'].isin(err_namedAreas)),'namedAreas'] = new_namedAreas
        count[0] = count[0] + 1
    except:
        df.ix[(df.loc[:,'streetAddress']==i) & (df.loc[:,'namedAreas'].isin(err_namedAreas)),'namedAreas'] = 'Unknown'
        count[1] = count[1] + 1
print(count[0],'addresses matched')
print(count[1],'addresses not matched\n')

Kvm : Torsgatan 61 -> Vasastan
Såld : Siargatan 17 -> Södermalm Katarina
NULL : Österlånggatan 23 -> Gamla Stan
Kvm : Polhemsgatan 6 -> Kungsholmen
Området : Bjurholmsgatan 37 -> Södermalm
NULL : Tideliusgatan 15 -> Södermalm Katarina
Över : Rörstrandsgatan 38A -> Vasastan
Börja : Rosenlundsgatan 20 -> Södermalm Maria
NULL : Love Almqvists väg 4A -> Kungsholmen
NULL : Fleminggatan 45 -> Kungsholmen
NULL : Rålambsvägen 72 -> Kungsholmen Fredhäll
NULL : Östgötagatan 68 -> Södermalm Katarina
NULL : Svartensgatan 5 -> Södermalm Katarina
Såld : Birkagatan 19 -> Vasastan
Över : Parkgatan 8 -> Kungsholmen
15 addresses matched
8 addresses not matched



Still, the above needs a lot more work in order to be useful. I'll try another method through GeoPy and the coordinates given.

## GeoPy
https://geopy.readthedocs.io/

In [4]:
# Initiate geolocator and add column to dataframe
geolocator = Nominatim()
df.loc[:,'geolocation'] = np.nan

In [118]:
# Loop through the coordinates in the dataframe. 
# The number of requests is limited though, so I have to execute this cell multiple times and days.
geoadd = 0
geoexist = 0
for index, row in df.iterrows():
    if pd.isnull(row['geolocation']):
        try:
            location = geolocator.reverse((row['coordinates']['latitude'],row['coordinates']['longitude']))
            df.loc[index,'geolocation'] = [location.raw]
            geoadd = geoadd+1
        except:
            print(sys.exc_info())
            break
    else:
        geoexist=geoexist+1
print('geolocations added:',geoadd)
print('geolocations exist:',geoexist)

geolocations added: 873
geolocations exist: 7196


In [72]:
# What info did we obtain from the coordinates? Look at the last one obtained.
[location.raw][0]

{'address': {'city': 'Sthlm',
  'city_district': 'Norrmalms stadsdelsområde',
  'country': 'Sverige',
  'country_code': 'se',
  'county': 'Stockholm',
  'house_number': '125',
  'neighbourhood': 'Sibirien',
  'postcode': '113 54',
  'road': 'Birger Jarlsgatan',
  'state': 'Stockholms län',
  'state_district': 'Landskapet Uppland',
  'suburb': 'Vasastan'},
 'boundingbox': ['59.3504593', '59.3506593', '18.0574463', '18.0576463'],
 'display_name': '125, Birger Jarlsgatan, Sibirien, Vasastan, Norrmalms stadsdelsområde, Sthlm, Stockholm, Landskapet Uppland, Stockholms län, Svealand, 113 54, Sverige',
 'lat': '59.3505593',
 'licence': 'Data © OpenStreetMap contributors, ODbL 1.0. http://www.openstreetmap.org/copyright',
 'lon': '18.0575463',
 'osm_id': '1611096502',
 'osm_type': 'node',
 'place_id': '16767250'}

Extract Suburb, neighbourhood and city_district

In [108]:
df.loc[:,'suburb'] = np.nan
df.loc[:,'neighbourhood'] = np.nan
df.loc[:,'city_district'] = np.nan

In [119]:
geoadded = 0
geomissing = 0
for index, row in df.iterrows():
    if pd.isnull(row['geolocation']) == False:
        # If geolocation is not null
        try:
            df.loc[index,'suburb'] = df.loc[index, 'geolocation'][0]['address']['suburb']
        except:
            pass
        try:
            df.loc[index,'neighbourhood'] = df.loc[index, 'geolocation'][0]['address']['neighbourhood']
        except:
            pass
        try:
            df.loc[index,'city_district'] = df.loc[index, 'geolocation'][0]['address']['city_district']
        except:
            pass
        geoadded = geoadded+1
    else:
        geomissing = geomissing+1
print('geopy excisting:\t',geoadded)
print('geopy geomissing:\t',geomissing)

geopy excisting:	 8069
geopy geomissing:	 0


In [216]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 8069 entries, 2210130 to 2013014
Data columns (total 32 columns):
additionalArea         8069 non-null float64
booliId                8069 non-null int64
constructionYear       7796 non-null float64
floor                  7462 non-null float64
isNewConstruction      8069 non-null float64
listPrice              8007 non-null float64
livingArea             8059 non-null float64
location               8069 non-null object
objectType             8069 non-null object
published              8069 non-null object
rent                   8041 non-null float64
rooms                  8064 non-null float64
soldDate               8069 non-null datetime64[ns]
soldPrice              8069 non-null int64
source                 8069 non-null object
url                    8069 non-null object
namedAreas             8069 non-null object
streetAddress          8069 non-null object
streetAddressNumber    8069 non-null object
streetAddressName      8069 non-nu

city_district and suburb seems to be the fields that we want to look at, not many objects contained the neighbourhood field. Check the these fields and while we're at it, add some price info.

In [224]:
# Suburb
agg = {'suburb':{'count':'count'},
       'soldPriceSqm':{'mean':'mean'},
       'soldPrice':{'mean':'mean'},
       'listPrice':{'mean':'mean'}}
display(
    df.groupby('suburb').agg(agg).
    sort_values(by=('soldPriceSqm','mean'),ascending=False).
    style.background_gradient(cmap='RdYlGn',high=0.2, low=0.2).
    highlight_null('white')
    )

Unnamed: 0_level_0,soldPriceSqm,soldPrice,suburb,listPrice
Unnamed: 0_level_1,mean,mean,count,mean
suburb,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2
Djurgården,141509.0,15000000.0,1,10500000.0
Östermalm,104512.0,7543120.0,623,6983350.0
Gamla stan,102503.0,5884520.0,31,5438550.0
Vasastaden,100143.0,5559280.0,21,4880710.0
Vasastan,94437.1,5570350.0,1578,4986340.0
Norrmalm,91817.2,5917380.0,191,5460390.0
Kungsholmen,91493.2,4874660.0,938,4342960.0
Ladugårdsgärdet,90770.1,4963280.0,341,4450160.0
Södermalm,87699.0,4615400.0,2097,3982820.0
Reimersholme,86108.8,5192500.0,50,4410500.0


In [225]:
# city_district
agg = {'city_district':{'count':'count'},
       'soldPriceSqm':{'mean':'mean'},
       'soldPrice':{'mean':'mean'},
       'listPrice':{'mean':'mean'}}
display(
    df.groupby('city_district').agg(agg).
    sort_values(by=('soldPriceSqm','mean'),ascending=False).
    style.background_gradient(cmap='RdYlGn',high=0.2, low=0.2).
    highlight_null('white')
    )

Unnamed: 0_level_0,soldPriceSqm,soldPrice,city_district,listPrice
Unnamed: 0_level_1,mean,mean,count,mean
city_district,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2
Norrmalms stadsdelsområde,94224.5,5607250.0,1790,5035750.0
Östermalms stadsdelsområde,93911.4,6042060.0,1332,5550900.0
Kungsholmens stadsdelsområde,85952.8,4526840.0,1880,4053690.0
Södermalms stadsdelsområde,85490.9,4710980.0,2537,4109060.0
Enskede-Årsta-Vantörs stadsdelsområde,56788.6,3229660.0,530,2869280.0


# Broker

### Add info on price difference, soldPrice-listPrice (we call it lockpris)

In [262]:
df.loc[:,'changedPrice'] = df.loc[:,'soldPrice']-df.loc[:,'listPrice']
df.loc[:,'changedPriceSqm'] = df.loc[:,'changedPrice']/df.loc[:,'livingArea']
print('Average change from list price to sold price:\t',int(df.loc[:,'changedPrice'].mean()),'kr')
print('Average change from list price to sold price:\t',round((df.loc[:,'changedPrice']/df.loc[:,'listPrice']).mean()*100,2),'%')
print('Max (+) change from list price to sold price:\t',round((df.loc[:,'changedPrice']/df.loc[:,'listPrice']).max()*100,2),'%')
print('Max (-) change from list price to sold price:\t',round((df.loc[:,'changedPrice']/df.loc[:,'listPrice']).min()*100,2),'%')

Average change from list price to sold price:	 533645 kr
Average change from list price to sold price:	 14.26 %
Max (+) change from list price to sold price:	 90.7 %
Max (-) change from list price to sold price:	 -24.04 %


In [264]:
#Broker
agg = {'broker':{'count':'count'},
       'soldPriceSqm':{'mean':'mean'},
       'soldPrice':{'mean':'mean'},
       'listPrice':{'mean':'mean'},
       'changedPrice':{'mean':'mean'},
       'changedPriceSqm':{'mean':'mean'}}
display(
    df.groupby('broker').agg(agg).
    sort_values(by=('soldPriceSqm','mean'),ascending=False).
    style.background_gradient(cmap='RdYlGn',high=0.2, low=0.2).
    highlight_null('white')
    )# Broker

Unnamed: 0_level_0,soldPriceSqm,soldPrice,changedPriceSqm,broker,changedPrice,listPrice
Unnamed: 0_level_1,mean,mean,mean,count,mean,mean
broker,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2
Norling & Partners AB,115171.0,5053750.0,2478.63,4,72500.0,4981250.0
Fredegårds Fastighetsbyrå AB,111617.0,7795560.0,11543.2,9,662222.0,7133330.0
Estrad Fastighetsmäklare,111528.0,4015000.0,21250.0,1,765000.0,3250000.0
Skeppsholmen,105915.0,9355400.0,7556.2,50,673673.0,8491020.0
Lagerlings,105645.0,13383600.0,6094.58,67,707313.0,12676300.0
Siv Kraft Mäklarbyrå AB,105493.0,6791000.0,5507.8,5,107000.0,6684000.0
Patrik Hiltunen,101000.0,3131000.0,17290.3,1,536000.0,2595000.0
Behrer & Partners,100253.0,6438400.0,11616.5,106,649245.0,5789150.0
Berggren & Co Fastighetsmäkleri,100000.0,2800000.0,18035.7,1,505000.0,2295000.0
Per Jansson Fastighetsförmedling,99471.4,10274800.0,4370.5,107,389953.0,9927080.0


**This looks better, phew.. we look closer on this later**

In [261]:
# Store the dataframe so it can be loaded in other notebook.
#%store -r df
%store df
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 8069 entries, 2210130 to 2013014
Data columns (total 32 columns):
additionalArea         8069 non-null float64
booliId                8069 non-null int64
constructionYear       7796 non-null float64
floor                  7462 non-null float64
isNewConstruction      8069 non-null float64
listPrice              8007 non-null float64
livingArea             8059 non-null float64
location               8069 non-null object
objectType             8069 non-null object
published              8069 non-null object
rent                   8041 non-null float64
rooms                  8064 non-null float64
soldDate               8069 non-null datetime64[ns]
soldPrice              8069 non-null int64
source                 8069 non-null object
url                    8069 non-null object
namedAreas             8069 non-null object
streetAddress          8069 non-null object
streetAddressNumber    8069 non-null object
streetAddressName      8069 non-nu

**Next notebook will focus on visualization!**