In [1]:
import requests
from bs4 import BeautifulSoup
import pandas as pd
import numpy as np
import re
import math

In [43]:
data = pd.read_csv("sthlm_raw_clean.csv")

In [44]:
data

Unnamed: 0,adress,omrade,kvm,rum,maklare,avgift,slutpris,datum,prisförändring,gata_id_lst,gata_lst,stockholm_lst
0,"Gamla Brogatan 25, 2tr","Vasastan - City/Norrmalm,",114.0,3.5,Fastighetsbyrån Stockholm - Vasastan,6769.0,7600000,2018-04-13,-5,475079,Gamla Brogatan,Stockholms kommun
1,"Gamla Brogatan 25, 2 tr","Vasastan- City/ Norrmalm,",71.0,2.0,Mäklarhuset Stockholm Innerstan,4696.0,5050000,2016-06-23,4,475079,Gamla Brogatan,Stockholms kommun
2,Gamla Brogatan 25,"Vasastan- City/ Norrmalm,",102.0,4.0,Mäklarhuset Stockholm Innerstan,6519.0,6950000,2016-04-29,1,475079,Gamla Brogatan,Stockholms kommun
3,"Gamla Brogatan 25, 2tr","Vasastan - City/Norrmalm,",107.0,4.0,Fastighetsbyrån Stockholm - Vasastan,6713.0,7150000,2015-11-26,2,475079,Gamla Brogatan,Stockholms kommun
4,"Drottninggatan 114 A, 3 tr","Vasastan - Norrmalm,",90.0,3.0,Bostadsrättsspecialisten,3822.0,8900000,2020-08-13,0,475084,Drottninggatan,Stockholms kommun
...,...,...,...,...,...,...,...,...,...,...,...,...
57938,"Lindevägen 56, 3 tr","Enskede Gård,",74.5,3.0,Svensk Fastighetsförmedling,5002.0,2740000,2014-04-10,19,476350,Lindevägen,Stockholms kommun
57939,Lindevägen 56,"Enskede Gård,",63.0,2.0,Svensk Fastighetsförmedling,3960.0,1905000,2014-02-28,12,476350,Lindevägen,Stockholms kommun
57940,"Lindevägen 44, 2tr","Enskede Gård,",107.0,4.0,Fastighetsbyrån Enskede,6977.0,3720000,2013-08-22,6,476350,Lindevägen,Stockholms kommun
57941,Lindevägen 50,"Enskede Gård,",61.5,2.0,Mäklarhuset Enskede,4221.0,2100000,2013-03-07,11,476350,Lindevägen,Stockholms kommun


# Intro

In this file I'm adding columns from the existing data. 

To clarify I'm not adding new data but adding new columns based on existing data.

Columns to add:

1) kr_kvm

2) år

3) månad

4) år-månad

# Add kr_kvm

In [45]:
# adding kr_kvm based on price and square-metres
data["kr_kvm"] = data.slutpris / data.kvm

In [46]:
# change type from float to int
data.kr_kvm = data.kr_kvm.astype(int)

## Inspection values kr_kvm

Taking the opportunity to check if there are unreasonable values based off price / square-metres

In [47]:
data.kr_kvm.describe()

count     57943.000000
mean      84225.726421
std       18528.353741
min       25714.000000
25%       71346.000000
50%       83561.000000
75%       96052.000000
max      220588.000000
Name: kr_kvm, dtype: float64

In [48]:
# Looks reasonable
data.kr_kvm.nsmallest()

19934    25714
42458    25806
50297    26923
51437    28525
19669    29166
Name: kr_kvm, dtype: int32

In [49]:
# Looks reasonable
data.kr_kvm.nlargest()

31407    220588
22780    220400
31405    215053
29827    207947
11193    202857
Name: kr_kvm, dtype: int32

# Add år

In [50]:
# adding year-column for future slicing
data["år"] = pd.DatetimeIndex(data["datum"]).year

# Add månad

In [51]:
data["månad"] = pd.DatetimeIndex(data["datum"]).month

# Add år-månad

In [52]:
data["år_månad"] = pd.to_datetime(data.år.astype(str) + "-" + data.månad.astype(str))

# Inspection

In [53]:
# looks good 
data.head()

Unnamed: 0,adress,omrade,kvm,rum,maklare,avgift,slutpris,datum,prisförändring,gata_id_lst,gata_lst,stockholm_lst,kr_kvm,år,månad,år_månad
0,"Gamla Brogatan 25, 2tr","Vasastan - City/Norrmalm,",114.0,3.5,Fastighetsbyrån Stockholm - Vasastan,6769.0,7600000,2018-04-13,-5,475079,Gamla Brogatan,Stockholms kommun,66666,2018,4,2018-04-01
1,"Gamla Brogatan 25, 2 tr","Vasastan- City/ Norrmalm,",71.0,2.0,Mäklarhuset Stockholm Innerstan,4696.0,5050000,2016-06-23,4,475079,Gamla Brogatan,Stockholms kommun,71126,2016,6,2016-06-01
2,Gamla Brogatan 25,"Vasastan- City/ Norrmalm,",102.0,4.0,Mäklarhuset Stockholm Innerstan,6519.0,6950000,2016-04-29,1,475079,Gamla Brogatan,Stockholms kommun,68137,2016,4,2016-04-01
3,"Gamla Brogatan 25, 2tr","Vasastan - City/Norrmalm,",107.0,4.0,Fastighetsbyrån Stockholm - Vasastan,6713.0,7150000,2015-11-26,2,475079,Gamla Brogatan,Stockholms kommun,66822,2015,11,2015-11-01
4,"Drottninggatan 114 A, 3 tr","Vasastan - Norrmalm,",90.0,3.0,Bostadsrättsspecialisten,3822.0,8900000,2020-08-13,0,475084,Drottninggatan,Stockholms kommun,98888,2020,8,2020-08-01


In [54]:
# looks good
# notice though that i'm failing to keep the datetime-format for the column "datum" when reading the csv
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 57943 entries, 0 to 57942
Data columns (total 16 columns):
 #   Column          Non-Null Count  Dtype         
---  ------          --------------  -----         
 0   adress          57943 non-null  object        
 1   omrade          57218 non-null  object        
 2   kvm             57943 non-null  float64       
 3   rum             57943 non-null  float64       
 4   maklare         57943 non-null  object        
 5   avgift          57731 non-null  float64       
 6   slutpris        57943 non-null  int64         
 7   datum           57943 non-null  object        
 8   prisförändring  57943 non-null  int64         
 9   gata_id_lst     57943 non-null  int64         
 10  gata_lst        56688 non-null  object        
 11  stockholm_lst   57943 non-null  object        
 12  kr_kvm          57943 non-null  int32         
 13  år              57943 non-null  int64         
 14  månad           57943 non-null  int64         
 15  år

# write to csv

I've still got one more round of cleaning before moving on to analysis. 

I'm therefore still naming my csv-file in a way so i can keep track of the changes I've made

In [56]:
data.to_csv("sthlm_raw_clean_added_columns.csv")