### Clean dataset on stolen bikes.


In [61]:
# standard import of pandas
import pandas as pd

## Loading the first dataset
The data we'll use is data on bicycle theft crimes at the granular level of Berlin city planning areas, so-called "LOR" - "Lebensweltlich orientierte Räume", we will stumble over it again later!  
This data is provided by Berlin Open Data and collected by the police of Berlin.  

### The goal for later: To be able to identify areas in Berlin with the most bike thefts or the highest theft amounts  
### The goal for today: clean this dataset to prepare it for our data analysis

First things first: We make the data accessible just by loading the .csv-file into a dataframe and get an overview.

[Website to datatset -  daten.berlin.de](https://daten.berlin.de/datensaetze/fahrraddiebstahl-berlin)

- Licence:
    - Creative Commons Namensnennung CC-BY License
- Geographical Granularity: 
    - Berlin
- Publisher: 
    - Polizei Berlin LKA St 14
- E Mail: 
    - onlineredaktion@polizei.berlin.de

In [62]:
# proper encoding is necessary here!
thefts_df_raw = pd.read_csv('../../data/Fahrraddiebstahl.csv', encoding='latin-1') 
#thefts_df_raw
 # make column names lowercase
thefts_df_raw.columns = thefts_df_raw.columns.str.lower() 
thefts_df_raw.head(2)

Unnamed: 0,angelegt_am,tatzeit_anfang_datum,tatzeit_anfang_stunde,tatzeit_ende_datum,tatzeit_ende_stunde,lor,schadenshoehe,versuch,art_des_fahrrads,delikt,erfassungsgrund
0,14.09.2020,10.09.2020,10,10.09.2020,12,3400723,706,Nein,Damenfahrrad,Fahrraddiebstahl,Sonstiger schwerer Diebstahl von Fahrrädern
1,29.09.2020,09.09.2020,16,10.09.2020,7,9200716,220,Nein,Damenfahrrad,Fahrraddiebstahl,Sonstiger schwerer Diebstahl von Fahrrädern


In [63]:
# what's the shape, the observations, datatypes and null-counts?
thefts_df_raw.shape

(39407, 11)

In [64]:
thefts_df_raw.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 39407 entries, 0 to 39406
Data columns (total 11 columns):
 #   Column                 Non-Null Count  Dtype 
---  ------                 --------------  ----- 
 0   angelegt_am            39407 non-null  object
 1   tatzeit_anfang_datum   39407 non-null  object
 2   tatzeit_anfang_stunde  39407 non-null  int64 
 3   tatzeit_ende_datum     39407 non-null  object
 4   tatzeit_ende_stunde    39407 non-null  int64 
 5   lor                    39407 non-null  int64 
 6   schadenshoehe          39407 non-null  int64 
 7   versuch                39407 non-null  object
 8   art_des_fahrrads       39407 non-null  object
 9   delikt                 39407 non-null  object
 10  erfassungsgrund        39407 non-null  object
dtypes: int64(4), object(7)
memory usage: 3.3+ MB


In [65]:
thefts_df_raw.describe() # includes objects will include only the numeric data type

Unnamed: 0,tatzeit_anfang_stunde,tatzeit_ende_stunde,lor,schadenshoehe
count,39407.0,39407.0,39407.0,39407.0
mean,14.525922,13.276626,5534755.0,825.206055
std,5.344165,5.217552,3336003.0,809.575186
min,0.0,0.0,1100101.0,0.0
25%,10.0,9.0,2500831.0,372.0
50%,16.0,14.0,4501042.0,599.0
75%,19.0,17.0,8100314.0,999.0
max,23.0,23.0,12601240.0,9999.0


In [66]:
thefts_df_raw.describe(include='O') # includes objects only with 0 argument passed

Unnamed: 0,angelegt_am,tatzeit_anfang_datum,tatzeit_ende_datum,versuch,art_des_fahrrads,delikt,erfassungsgrund
count,39407,39407,39407,39407,39407,39407,39407
unique,698,698,698,3,8,2,4
top,14.09.2020,15.09.2020,14.09.2020,Nein,Herrenfahrrad,Fahrraddiebstahl,Sonstiger schwerer Diebstahl von Fahrrädern
freq,146,124,127,39230,18007,37517,35274


In [67]:
thefts_df_raw.describe(include='all')

Unnamed: 0,angelegt_am,tatzeit_anfang_datum,tatzeit_anfang_stunde,tatzeit_ende_datum,tatzeit_ende_stunde,lor,schadenshoehe,versuch,art_des_fahrrads,delikt,erfassungsgrund
count,39407,39407,39407.0,39407,39407.0,39407.0,39407.0,39407,39407,39407,39407
unique,698,698,,698,,,,3,8,2,4
top,14.09.2020,15.09.2020,,14.09.2020,,,,Nein,Herrenfahrrad,Fahrraddiebstahl,Sonstiger schwerer Diebstahl von Fahrrädern
freq,146,124,,127,,,,39230,18007,37517,35274
mean,,,14.525922,,13.276626,5534755.0,825.206055,,,,
std,,,5.344165,,5.217552,3336003.0,809.575186,,,,
min,,,0.0,,0.0,1100101.0,0.0,,,,
25%,,,10.0,,9.0,2500831.0,372.0,,,,
50%,,,16.0,,14.0,4501042.0,599.0,,,,
75%,,,19.0,,17.0,8100314.0,999.0,,,,


In [68]:
thefts_df_raw.dtypes

angelegt_am              object
tatzeit_anfang_datum     object
tatzeit_anfang_stunde     int64
tatzeit_ende_datum       object
tatzeit_ende_stunde       int64
lor                       int64
schadenshoehe             int64
versuch                  object
art_des_fahrrads         object
delikt                   object
erfassungsgrund          object
dtype: object

In [69]:
thefts_df_raw.isnull().sum()

angelegt_am              0
tatzeit_anfang_datum     0
tatzeit_anfang_stunde    0
tatzeit_ende_datum       0
tatzeit_ende_stunde      0
lor                      0
schadenshoehe            0
versuch                  0
art_des_fahrrads         0
delikt                   0
erfassungsgrund          0
dtype: int64

In [70]:
thefts_df_raw.isna().sum()

angelegt_am              0
tatzeit_anfang_datum     0
tatzeit_anfang_stunde    0
tatzeit_ende_datum       0
tatzeit_ende_stunde      0
lor                      0
schadenshoehe            0
versuch                  0
art_des_fahrrads         0
delikt                   0
erfassungsgrund          0
dtype: int64

Let's think about cleaning our data:

- drop duplicates? inspect!
- drop column 'angelegt_am' and 'erfassungsgrund' - irrelevant to us
- column 'versuch': inspect!  
- column 'tatzeit_anfang_datum': change date string to datetime format  
- column 'tatzeit_anfang_ende': change date string to datetime format

In [71]:
# inspect duplicates
duplicates = thefts_df_raw[thefts_df_raw.duplicated(keep=False)]
# keep=False => all duplicates are set as True
# keep='first' => first is set as False, rest duplicates are True
# keep='last' => last is set as False, rest duplicates are False

In [72]:
duplicates

Unnamed: 0,angelegt_am,tatzeit_anfang_datum,tatzeit_anfang_stunde,tatzeit_ende_datum,tatzeit_ende_stunde,lor,schadenshoehe,versuch,art_des_fahrrads,delikt,erfassungsgrund
433,10.06.2020,09.06.2020,20,10.06.2020,6,4200309,602,Nein,Herrenfahrrad,Fahrraddiebstahl,Sonstiger schwerer Diebstahl von Fahrrädern
720,17.06.2020,17.06.2020,7,17.06.2020,8,2100106,100,Nein,Damenfahrrad,Fahrraddiebstahl,Sonstiger schwerer Diebstahl von Fahrrädern
722,17.06.2020,17.06.2020,7,17.06.2020,8,2100106,100,Nein,Damenfahrrad,Fahrraddiebstahl,Sonstiger schwerer Diebstahl von Fahrrädern
909,17.08.2021,16.08.2021,18,17.08.2021,7,1200519,1000,Nein,Damenfahrrad,Fahrraddiebstahl,Sonstiger schwerer Diebstahl von Fahrrädern
967,17.08.2021,16.08.2021,18,17.08.2021,7,1200519,1000,Nein,Damenfahrrad,Fahrraddiebstahl,Sonstiger schwerer Diebstahl von Fahrrädern
...,...,...,...,...,...,...,...,...,...,...,...
37962,01.05.2020,30.04.2020,16,01.05.2020,12,7601443,333,Nein,Kinderfahrrad,Fahrraddiebstahl,Sonstiger schwerer Diebstahl von Fahrrädern
38505,21.02.2021,21.02.2021,0,21.02.2021,6,4300624,395,Nein,Damenfahrrad,Fahrraddiebstahl,Sonstiger schwerer Diebstahl von Fahrrädern
38507,21.02.2021,21.02.2021,0,21.02.2021,6,4300624,395,Nein,Damenfahrrad,Fahrraddiebstahl,Sonstiger schwerer Diebstahl von Fahrrädern
38941,16.07.2021,15.07.2021,22,16.07.2021,17,2200212,600,Nein,Fahrrad,Fahrraddiebstahl,Sonstiger schwerer Diebstahl von Fahrrädern


In [73]:
# inspect duplicates
duplicates.sort_values(by=['tatzeit_anfang_datum', 'schadenshoehe'])\
    .tail(6)

# backspace means the code continues in the next line

Unnamed: 0,angelegt_am,tatzeit_anfang_datum,tatzeit_anfang_stunde,tatzeit_ende_datum,tatzeit_ende_stunde,lor,schadenshoehe,versuch,art_des_fahrrads,delikt,erfassungsgrund
29789,01.09.2020,31.08.2020,18,01.09.2020,0,1400940,220,Nein,Mountainbike,Fahrraddiebstahl,Sonstiger schwerer Diebstahl von Fahrrädern
29866,01.09.2020,31.08.2020,18,01.09.2020,0,1400940,220,Nein,Mountainbike,Fahrraddiebstahl,Sonstiger schwerer Diebstahl von Fahrrädern
16188,01.09.2021,31.08.2021,16,31.08.2021,17,2400623,3900,Nein,Damenfahrrad,Fahrraddiebstahl,Sonstiger schwerer Diebstahl von Fahrrädern
16189,01.09.2021,31.08.2021,16,31.08.2021,17,2400623,3900,Nein,Damenfahrrad,Fahrraddiebstahl,Sonstiger schwerer Diebstahl von Fahrrädern
12041,02.11.2020,31.10.2020,18,02.11.2020,8,10100312,299,Nein,Damenfahrrad,Fahrraddiebstahl,Sonstiger schwerer Diebstahl von Fahrrädern
12045,02.11.2020,31.10.2020,18,02.11.2020,8,10100312,299,Nein,Damenfahrrad,Fahrraddiebstahl,Sonstiger schwerer Diebstahl von Fahrrädern


In [74]:
# total nr of duplicates
len(duplicates)

181

In [75]:
# the specifications of the duplicates indicate that they are implausible, so we drop them.
# drop duplicates and assign result to a new dataframe called 'thefts_df_dedup'

thefts_df_dedup = thefts_df_raw.drop_duplicates().copy()


In [76]:
# Always double check your results
print('thefts_df_raw count: '+str(len(thefts_df_raw)))
print('thefts_df_dedup: '+ str(len(thefts_df_dedup)))
print('difference: '+ str(len(thefts_df_raw)-len(thefts_df_dedup)))


thefts_df_raw count: 39407
thefts_df_dedup: 39311
difference: 96


In [77]:
# does this match with our duplicates?
# the 96 means there were 96 duplicated rows deleted

# number of rows that appear more than one time (not just twice, also three or four ... times)
print('nr of duplicates: '+ str(len(duplicates)))
#print(f'nr of duplicates: {len(duplicates)}')

# the remaining (unique) rows
print('nr of unique rows in duplicates: '+ str(len(duplicates.drop_duplicates())))
# this means the deleted rows, so these are the rows that appear multiple times (not just twice)
print('nr of duplicated rows in duplicates: '+ str(len(duplicates)-len(duplicates.drop_duplicates())))


nr of duplicates: 181
nr of unique rows in duplicates: 85
nr of duplicated rows in duplicates: 96


In [78]:
thefts_df_dedup['schadenshoehe'].size

# same result as len()

39311

In [79]:
# data.duplicated().value_counts()

do the numbers make sense to you? 

In [80]:
# in worst case, if this is really confusing, you can download and double check manually in Excel
# thefts_df_raw[thefts_df_raw.duplicated(keep=False)]\
#     .sort_values(by=['tatzeit_anfang_datum', 'schadenshoehe'])\
#         .to_csv('./check.csv')

...if yes, let's continue..

In [81]:
# drop column 'angelegt_am' and 'erfassungsgrund' - irrelevant to us, when and why observation got added to the database.

#thefts_new = thefts_df_dedup.drop('angelegt_am', axis='columns')
#thefts_newly = thefts_new.drop('erfassungsgrund', axis=1)
#thefts_newly.head()

thefts_df_dedup.drop(['angelegt_am', 'erfassungsgrund'], axis=1, inplace=True) # copy before inplace=True to keep the original df
thefts_df_dedup.head()

Unnamed: 0,tatzeit_anfang_datum,tatzeit_anfang_stunde,tatzeit_ende_datum,tatzeit_ende_stunde,lor,schadenshoehe,versuch,art_des_fahrrads,delikt
0,10.09.2020,10,10.09.2020,12,3400723,706,Nein,Damenfahrrad,Fahrraddiebstahl
1,09.09.2020,16,10.09.2020,7,9200716,220,Nein,Damenfahrrad,Fahrraddiebstahl
2,10.09.2020,15,10.09.2020,18,6100207,550,Nein,Herrenfahrrad,Fahrraddiebstahl
3,10.09.2020,20,10.09.2020,21,1300733,548,Nein,Herrenfahrrad,Fahrraddiebstahl
4,09.09.2020,22,10.09.2020,11,8100207,700,Nein,Fahrrad,Fahrraddiebstahl


In [82]:
# how many unique values holds the column of the attempts?
# look up 'unique()' and try to understand what it's doing

thefts_df_dedup.versuch.unique()

# it shows the different values in this column

array(['Nein', 'Ja', 'Unbekannt'], dtype=object)

In [83]:
# and what is the count of those categories?
# look up 'value_counts()' and try to understand what it's doing

thefts_df_dedup.versuch.value_counts()

# it counts how often a specific value is written

versuch
Nein         39137
Ja             167
Unbekannt        7
Name: count, dtype: int64

In [84]:
#versuch_unbekannt = thefts_df_dedup[thefts_df_dedup['versuch'] == 'Unbekannt']
#versuch_unbekannt

In [85]:
#versuch_ja_unbekannt = thefts_df_dedup[(thefts_df_dedup['versuch'] == 'Ja') | (thefts_df_dedup['versuch'] == 'Unbekannt')]
#versuch_ja_unbekannt

In [86]:
# we have just 167 attempts and 7 thefts of unknown state in our dataset, so we decide to drop those observations.

#thefts_df_dedup.drop(thefts_df_dedup[(thefts_df_dedup['versuch'] == 'Ja') | (thefts_df_dedup['versuch'] == 'Unbekannt')].index)

In [87]:
thefts_df_dedup.set_index('versuch', inplace=True)
thefts_df_dedup.drop(['Ja', 'Unbekannt'], inplace=True)

In [88]:
#thefts_df_dedup[(thefts_df_dedup['versuch'] == 'Ja') | (thefts_df_dedup['versuch'] == 'Unbekannt')]

In [89]:
thefts_df_dedup

Unnamed: 0_level_0,tatzeit_anfang_datum,tatzeit_anfang_stunde,tatzeit_ende_datum,tatzeit_ende_stunde,lor,schadenshoehe,art_des_fahrrads,delikt
versuch,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
Nein,10.09.2020,10,10.09.2020,12,3400723,706,Damenfahrrad,Fahrraddiebstahl
Nein,09.09.2020,16,10.09.2020,7,9200716,220,Damenfahrrad,Fahrraddiebstahl
Nein,10.09.2020,15,10.09.2020,18,6100207,550,Herrenfahrrad,Fahrraddiebstahl
Nein,10.09.2020,20,10.09.2020,21,1300733,548,Herrenfahrrad,Fahrraddiebstahl
Nein,09.09.2020,22,10.09.2020,11,8100207,700,Fahrrad,Fahrraddiebstahl
...,...,...,...,...,...,...,...,...
Nein,06.08.2021,18,09.08.2021,8,1100309,600,Fahrrad,Fahrraddiebstahl
Nein,07.08.2021,13,09.08.2021,8,1200522,3300,Herrenfahrrad,Fahrraddiebstahl
Nein,07.08.2021,11,09.08.2021,9,6100102,499,Damenfahrrad,Fahrraddiebstahl
Nein,09.08.2021,13,09.08.2021,14,2200211,300,Damenfahrrad,Fahrraddiebstahl


In [90]:
#thefts_df_dedup.drop(1)

In [91]:
type(thefts_df_dedup['tatzeit_anfang_datum'])

pandas.core.series.Series

In [92]:
# change date text string to datetime datatype
# fill in the gap....
thefts_df_dedup['tatzeit_anfang_datum'] = pd.to_datetime(thefts_df_dedup['tatzeit_anfang_datum'], format='mixed') # Y year 20xx, y year in xx
thefts_df_dedup['tatzeit_ende_datum'] = pd.to_datetime(thefts_df_dedup['tatzeit_ende_datum'], format='mixed')
thefts_df_dedup.info()

<class 'pandas.core.frame.DataFrame'>
Index: 39137 entries, Nein to Nein
Data columns (total 8 columns):
 #   Column                 Non-Null Count  Dtype         
---  ------                 --------------  -----         
 0   tatzeit_anfang_datum   39137 non-null  datetime64[ns]
 1   tatzeit_anfang_stunde  39137 non-null  int64         
 2   tatzeit_ende_datum     39137 non-null  datetime64[ns]
 3   tatzeit_ende_stunde    39137 non-null  int64         
 4   lor                    39137 non-null  int64         
 5   schadenshoehe          39137 non-null  int64         
 6   art_des_fahrrads       39137 non-null  object        
 7   delikt                 39137 non-null  object        
dtypes: datetime64[ns](2), int64(4), object(2)
memory usage: 2.7+ MB


In [93]:
# now that the dates are not only strings anymore, we can have a look at the timeframe
thefts_df_dedup.tatzeit_anfang_datum.min(), thefts_df_dedup.tatzeit_ende_datum.max()

(Timestamp('2020-01-01 00:00:00'), Timestamp('2021-12-11 00:00:00'))

In [94]:
# ... or can even do calculations on the date fields
thefts_df_dedup.tatzeit_ende_datum.max() - thefts_df_dedup.tatzeit_anfang_datum.min()

Timedelta('710 days 00:00:00')

In [95]:
# confirm the new datatypes
thefts_df_dedup[['tatzeit_anfang_datum', 'tatzeit_ende_datum']].info()

<class 'pandas.core.frame.DataFrame'>
Index: 39137 entries, Nein to Nein
Data columns (total 2 columns):
 #   Column                Non-Null Count  Dtype         
---  ------                --------------  -----         
 0   tatzeit_anfang_datum  39137 non-null  datetime64[ns]
 1   tatzeit_ende_datum    39137 non-null  datetime64[ns]
dtypes: datetime64[ns](2)
memory usage: 917.3+ KB


### Yay!  We're done with cleaning our dataset :-) 

Now, we want to re-use this code later. Let's wrap all the final cleaning steps that we came up with into a function. The function should:
- be called 'clean_bike_data',
- have a dataframe df as input variable,
- return the same dataframe df with all the cleaning steps performed on it.
- Add comments to explain each cleaning step.

Test your function with your dataframe !

In [1]:
import pandas as pd

In [2]:
def clean_bike_data(df):
    # drop duplicates
    df = df.drop_duplicates().copy()
    # drop column 'angelegt_am' and 'erfassungsgrund' - irrelevant to us, when and why observation got added to the database.
    df.drop(['angelegt_am', 'erfassungsgrund'], axis=1, inplace=True)
    #df = df.drop(columns=['angelegt_am', 'erfassungsgrund']) # alternative zu zeile drüber
    # we have just 167 attempts and 7 thefts of unknown state in our dataset, so we decide to drop those observations.
    df.set_index('versuch', inplace=True)
    df.drop(['Ja', 'Unbekannt'], inplace=True)
    # change date text string to datetime datatype
    df['tatzeit_anfang_datum'] = pd.to_datetime(df['tatzeit_anfang_datum'], format='%d.%m.%Y')
    df['tatzeit_ende_datum'] = pd.to_datetime(df['tatzeit_ende_datum'], format='mixed')
    
    return df

In [3]:
# test your function

# read in the raw data again
# proper encoding is necessary here!
thefts_df_test = pd.read_csv('../../data/Fahrraddiebstahl.csv', encoding='latin-1') 
 # make column names lowercase
thefts_df_test.columns = thefts_df_test.columns.str.lower() 
thefts_df_test.head(2)

clean_df = clean_bike_data(thefts_df_test)

In [4]:
clean_df

Unnamed: 0_level_0,tatzeit_anfang_datum,tatzeit_anfang_stunde,tatzeit_ende_datum,tatzeit_ende_stunde,lor,schadenshoehe,art_des_fahrrads,delikt
versuch,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
Nein,2020-09-10,10,2020-10-09,12,3400723,706,Damenfahrrad,Fahrraddiebstahl
Nein,2020-09-09,16,2020-10-09,7,9200716,220,Damenfahrrad,Fahrraddiebstahl
Nein,2020-09-10,15,2020-10-09,18,6100207,550,Herrenfahrrad,Fahrraddiebstahl
Nein,2020-09-10,20,2020-10-09,21,1300733,548,Herrenfahrrad,Fahrraddiebstahl
Nein,2020-09-09,22,2020-10-09,11,8100207,700,Fahrrad,Fahrraddiebstahl
...,...,...,...,...,...,...,...,...
Nein,2021-08-06,18,2021-09-08,8,1100309,600,Fahrrad,Fahrraddiebstahl
Nein,2021-08-07,13,2021-09-08,8,1200522,3300,Herrenfahrrad,Fahrraddiebstahl
Nein,2021-08-07,11,2021-09-08,9,6100102,499,Damenfahrrad,Fahrraddiebstahl
Nein,2021-08-09,13,2021-09-08,14,2200211,300,Damenfahrrad,Fahrraddiebstahl
