In [1]:
import pandas as pd
import numpy as np
import re
import math

In [27]:
data = pd.read_csv("sthlm_raw_clean_added_columns.csv")

# Deciding on start-date for data-set

I need to select an appropriate start-date for my data-set. In the analysis I will compare different time-periods and I need the number of observations for each period to be similar or big enough. 

In [28]:
# I'm not able to keep the datetime-format when reading my csv so here I'm just changing datum to datetime-format
data.datum = pd.to_datetime(data.datum)

In [29]:
# Looking at start & end date
data.datum.describe()

count                   57943
unique                   2726
top       2015-05-13 00:00:00
freq                       89
first     2011-02-09 00:00:00
last      2020-09-12 00:00:00
Name: datum, dtype: object

In [30]:
# Looking at no. of observations for each year
# Because of the small no. of observations for year 2011 and 2012...
# ...there is no point in keeping these rows, they will just disturb the analysis
data.år.value_counts()

2019    8830
2018    7867
2017    7809
2015    7728
2016    7717
2014    7116
2020    6810
2013    4059
2012       5
2011       2
Name: år, dtype: int64

In [31]:
# List with index for the rows i want to drop
lst_index_2011_2012 = data.loc[data.år == 2011].index.tolist() + data.loc[data.år == 2012].index.tolist()

In [32]:
# Dropping rows with year 2011 or 2012
data.drop(lst_index_2011_2012, inplace = True)

In [33]:
# Many observations for each year
# However significantly less observations for 2013 but still big enough I think
data.år.value_counts()

2019    8830
2018    7867
2017    7809
2015    7728
2016    7717
2014    7116
2020    6810
2013    4059
Name: år, dtype: int64

In [34]:
# Just making sure each month in 2013 has a decent number of observations
data.loc[data.år == 2013].månad.value_counts()

10    585
8     518
9     516
11    467
5     351
6     322
4     312
3     277
12    206
7     200
2     153
1     152
Name: månad, dtype: int64

# Decision start-date: 2013-01-01

I only dropped 7 rows (2 for year 2011 and 5 for year 2012) and reduced the date-span of my data-set from 2011-2020 to 2013-2020

There were significantly less observations for the year 2013 but the number was still quite high(4059) and each month had a decent number of observations. Therefore I'm making the judgement that the no. of observations for each period (year and/or month) will not skew my analysis (at least not to a large extent). 

# Data-set is ready for analysis: writing to csv

My data-set is now finished and ready for analysis

I'm specifying the name with v1(version 1)

In [40]:
# Dropping extra index
data.drop(columns = "Unnamed: 0", inplace = True)

In [43]:
# writing to csv, specifying that it should add index since there already is one
data.to_csv("sthlm_v1.csv", index = False)