### Clean dataset on stolen bikes.


In [1]:
# standard import of pandas
import pandas as pd

## Loading the first dataset
The data we'll use is data on bicycle theft crimes at the granular level of Berlin city planning areas, so-called "LOR" - "Lebensweltlich orientierte Räume", we will stumble over it again later!  
This data is provided by Berlin Open Data and collected by the police of Berlin.  

### The goal for later: To be able to identify areas in Berlin with the most bike thefts or the highest theft amounts  
### The goal for today: clean this dataset to prepare it for our data analysis

First things first: We make the data accessible just by loading the .csv-file into a dataframe and get an overview.

[Website to datatset -  daten.berlin.de](https://daten.berlin.de/datensaetze/fahrraddiebstahl-berlin)

- Licence:
    - Creative Commons Namensnennung CC-BY License
- Geographical Granularity: 
    - Berlin
- Publisher: 
    - Polizei Berlin LKA St 14
- E Mail: 
    - onlineredaktion@polizei.berlin.de

In [2]:
import pandas as pd

# proper encoding is necessary here!
thefts_df_raw = pd.read_csv('../../data/Fahrraddiebstahl.csv', encoding='latin-1') 
 # make column names lowercase
thefts_df_raw.columns = thefts_df_raw.columns.str.lower() 
thefts_df_raw.head(2)

Unnamed: 0,angelegt_am,tatzeit_anfang_datum,tatzeit_anfang_stunde,tatzeit_ende_datum,tatzeit_ende_stunde,lor,schadenshoehe,versuch,art_des_fahrrads,delikt,erfassungsgrund
0,14.09.2020,10.09.2020,10,10.09.2020,12,3400723,706,Nein,Damenfahrrad,Fahrraddiebstahl,Sonstiger schwerer Diebstahl von Fahrrädern
1,29.09.2020,09.09.2020,16,10.09.2020,7,9200716,220,Nein,Damenfahrrad,Fahrraddiebstahl,Sonstiger schwerer Diebstahl von Fahrrädern


In [3]:
# what's the shape, the observations, datatypes and null-counts?
thefts_df_raw.shape

(39407, 11)

In [4]:
thefts_df_raw.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 39407 entries, 0 to 39406
Data columns (total 11 columns):
 #   Column                 Non-Null Count  Dtype 
---  ------                 --------------  ----- 
 0   angelegt_am            39407 non-null  object
 1   tatzeit_anfang_datum   39407 non-null  object
 2   tatzeit_anfang_stunde  39407 non-null  int64 
 3   tatzeit_ende_datum     39407 non-null  object
 4   tatzeit_ende_stunde    39407 non-null  int64 
 5   lor                    39407 non-null  int64 
 6   schadenshoehe          39407 non-null  int64 
 7   versuch                39407 non-null  object
 8   art_des_fahrrads       39407 non-null  object
 9   delikt                 39407 non-null  object
 10  erfassungsgrund        39407 non-null  object
dtypes: int64(4), object(7)
memory usage: 3.3+ MB


Let's think about cleaning our data:

- drop duplicates? inspect!
- drop column 'angelegt_am' and 'erfassungsgrund' - irrelevant to us
- column 'versuch': inspect!  
- column 'tatzeit_anfang_datum': change date string to datetime format  
- column 'tatzeit_anfang_ende': change date string to datetime format

In [5]:
# inspect duplicates
duplicates = thefts_df_raw[thefts_df_raw.duplicated(keep=False)]

In [6]:
# inspect duplicates
duplicates.sort_values(by=['tatzeit_anfang_datum', 'schadenshoehe'])\
    .tail(6)

Unnamed: 0,angelegt_am,tatzeit_anfang_datum,tatzeit_anfang_stunde,tatzeit_ende_datum,tatzeit_ende_stunde,lor,schadenshoehe,versuch,art_des_fahrrads,delikt,erfassungsgrund
29789,01.09.2020,31.08.2020,18,01.09.2020,0,1400940,220,Nein,Mountainbike,Fahrraddiebstahl,Sonstiger schwerer Diebstahl von Fahrrädern
29866,01.09.2020,31.08.2020,18,01.09.2020,0,1400940,220,Nein,Mountainbike,Fahrraddiebstahl,Sonstiger schwerer Diebstahl von Fahrrädern
16188,01.09.2021,31.08.2021,16,31.08.2021,17,2400623,3900,Nein,Damenfahrrad,Fahrraddiebstahl,Sonstiger schwerer Diebstahl von Fahrrädern
16189,01.09.2021,31.08.2021,16,31.08.2021,17,2400623,3900,Nein,Damenfahrrad,Fahrraddiebstahl,Sonstiger schwerer Diebstahl von Fahrrädern
12041,02.11.2020,31.10.2020,18,02.11.2020,8,10100312,299,Nein,Damenfahrrad,Fahrraddiebstahl,Sonstiger schwerer Diebstahl von Fahrrädern
12045,02.11.2020,31.10.2020,18,02.11.2020,8,10100312,299,Nein,Damenfahrrad,Fahrraddiebstahl,Sonstiger schwerer Diebstahl von Fahrrädern


In [7]:
# total nr of duplicates
len(duplicates)

181

In [8]:
# the specifications of the duplicates indicate that they are implausible, so we drop them.
# drop duplicates and assign result to a new dataframe called 'thefts_df_dedup'

thefts_df_dedup = thefts_df_raw.drop_duplicates()


In [9]:
# Always double check your results
print('thefts_df_raw count: '+str(len(thefts_df_raw)))
print('thefts_df_dedup: '+ str(len(thefts_df_dedup)))
print('difference: '+ str(len(thefts_df_raw)-len(thefts_df_dedup)))


thefts_df_raw count: 39407
thefts_df_dedup: 39311
difference: 96


In [10]:
# does this match with our duplicates?
# the 96 means there were 96 duplicated rows deleted
print('nr of duplicates: '+ str(len(duplicates)))
print('nr of unique rows in duplicates: '+ str(len(duplicates.drop_duplicates())))
print('nr of duplicated rows in duplicates: '+ str(len(duplicates)-len(duplicates.drop_duplicates())))


nr of duplicates: 181
nr of unique rows in duplicates: 85
nr of duplicated rows in duplicates: 96


do the numbers make sense to you? 

In [11]:
# in worst case, if this is really confusing, you can download and double check manually in Excel
# thefts_df_raw[thefts_df_raw.duplicated(keep=False)]\
#     .sort_values(by=['tatzeit_anfang_datum', 'schadenshoehe'])\
#         .to_csv('./check.csv')

...if yes, let's continue..

In [12]:
# drop column 'angelegt_am' and 'erfassungsgrund' - irrelevant to us, when and why observation got added to the database.
thefts_df_raw.drop(['angelegt_am', 'erfassungsgrund'], axis=1, inplace=True)
thefts_df_raw

Unnamed: 0,tatzeit_anfang_datum,tatzeit_anfang_stunde,tatzeit_ende_datum,tatzeit_ende_stunde,lor,schadenshoehe,versuch,art_des_fahrrads,delikt
0,10.09.2020,10,10.09.2020,12,3400723,706,Nein,Damenfahrrad,Fahrraddiebstahl
1,09.09.2020,16,10.09.2020,7,9200716,220,Nein,Damenfahrrad,Fahrraddiebstahl
2,10.09.2020,15,10.09.2020,18,6100207,550,Nein,Herrenfahrrad,Fahrraddiebstahl
3,10.09.2020,20,10.09.2020,21,1300733,548,Nein,Herrenfahrrad,Fahrraddiebstahl
4,09.09.2020,22,10.09.2020,11,8100207,700,Nein,Fahrrad,Fahrraddiebstahl
...,...,...,...,...,...,...,...,...,...
39402,06.08.2021,18,09.08.2021,8,1100309,600,Nein,Fahrrad,Fahrraddiebstahl
39403,07.08.2021,13,09.08.2021,8,1200522,3300,Nein,Herrenfahrrad,Fahrraddiebstahl
39404,07.08.2021,11,09.08.2021,9,6100102,499,Nein,Damenfahrrad,Fahrraddiebstahl
39405,09.08.2021,13,09.08.2021,14,2200211,300,Nein,Damenfahrrad,Fahrraddiebstahl


In [13]:
# how many unique values holds the column of the attempts?
# look up 'unique()' and try to understand what it's doing

print(thefts_df_dedup.versuch.unique())

num_unique_versuch= len(thefts_df_dedup.versuch.unique())
print("Number of unique values in the 'versuch' column:", num_unique_versuch)

['Nein' 'Ja' 'Unbekannt']
Number of unique values in the 'versuch' column: 3


In [14]:
# and what is the count of those categories?
# look up 'value_counts()' and try to understand what it's doing

thefts_df_dedup.versuch.value_counts()

Nein         39137
Ja             167
Unbekannt        7
Name: versuch, dtype: int64

In [15]:
# we have just 167 attempts and 7 thefts of unknown state in our dataset, so we decide to drop those observations.
thefts_df_dedup= thefts_df_dedup[(thefts_df_dedup['versuch'] != 'Ja') & (thefts_df_dedup['versuch'] != 'Unbekannt')]
thefts_df_dedup



#Option 2:
#thefts_df_dedup_cleaned= thefts_df_dedup.drop(thefts_df_dedup[(thefts_df_dedup['versuch'] == 'Ja') | (thefts_df_dedup['versuch'] == 'Unbekannt')].index)

Unnamed: 0,angelegt_am,tatzeit_anfang_datum,tatzeit_anfang_stunde,tatzeit_ende_datum,tatzeit_ende_stunde,lor,schadenshoehe,versuch,art_des_fahrrads,delikt,erfassungsgrund
0,14.09.2020,10.09.2020,10,10.09.2020,12,3400723,706,Nein,Damenfahrrad,Fahrraddiebstahl,Sonstiger schwerer Diebstahl von Fahrrädern
1,29.09.2020,09.09.2020,16,10.09.2020,7,9200716,220,Nein,Damenfahrrad,Fahrraddiebstahl,Sonstiger schwerer Diebstahl von Fahrrädern
2,10.09.2020,10.09.2020,15,10.09.2020,18,6100207,550,Nein,Herrenfahrrad,Fahrraddiebstahl,Sonstiger schwerer Diebstahl von Fahrrädern
3,10.09.2020,10.09.2020,20,10.09.2020,21,1300733,548,Nein,Herrenfahrrad,Fahrraddiebstahl,Sonstiger schwerer Diebstahl von Fahrrädern
4,23.09.2020,09.09.2020,22,10.09.2020,11,8100207,700,Nein,Fahrrad,Fahrraddiebstahl,Sonstiger schwerer Diebstahl von Fahrrädern
...,...,...,...,...,...,...,...,...,...,...,...
39402,09.08.2021,06.08.2021,18,09.08.2021,8,1100309,600,Nein,Fahrrad,Fahrraddiebstahl,Sonstiger schwerer Diebstahl von Fahrrädern
39403,09.08.2021,07.08.2021,13,09.08.2021,8,1200522,3300,Nein,Herrenfahrrad,Fahrraddiebstahl,Sonstiger schwerer Diebstahl von Fahrrädern
39404,09.08.2021,07.08.2021,11,09.08.2021,9,6100102,499,Nein,Damenfahrrad,Fahrraddiebstahl,Sonstiger schwerer Diebstahl von Fahrrädern
39405,09.08.2021,09.08.2021,13,09.08.2021,14,2200211,300,Nein,Damenfahrrad,Fahrraddiebstahl,Sonstiger schwerer Diebstahl von Fahrrädern


In [16]:
# change date text string to datetime datatype
# fill in the gap....
thefts_df_dedup['tatzeit_anfang_datum'] = pd.to_datetime(thefts_df_dedup['tatzeit_anfang_datum'], format='%d.%m.%Y')
thefts_df_dedup['tatzeit_ende_datum'] = pd.to_datetime(thefts_df_dedup['tatzeit_ende_datum'], format='%d.%m.%Y')

In [17]:
# now that the dates are not only strings anymore, we can have a look at the timeframe
thefts_df_dedup.tatzeit_anfang_datum.min(), thefts_df_dedup.tatzeit_ende_datum.max()

(Timestamp('2020-01-01 00:00:00'), Timestamp('2021-11-28 00:00:00'))

In [18]:
# ... or can even do calculations on the date fields
thefts_df_dedup.tatzeit_ende_datum.max() - thefts_df_dedup.tatzeit_anfang_datum.min()

Timedelta('697 days 00:00:00')

In [19]:
# confirm the new datatypes
thefts_df_dedup[['tatzeit_anfang_datum', 'tatzeit_ende_datum']].info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 39137 entries, 0 to 39406
Data columns (total 2 columns):
 #   Column                Non-Null Count  Dtype         
---  ------                --------------  -----         
 0   tatzeit_anfang_datum  39137 non-null  datetime64[ns]
 1   tatzeit_ende_datum    39137 non-null  datetime64[ns]
dtypes: datetime64[ns](2)
memory usage: 917.3 KB


### Yay!  We're done with cleaning our dataset :-) 

Now, we want to re-use this code later. Let's wrap all the final cleaning steps that we came up with into a function. The function should:
- be called 'clean_bike_data',
- have a dataframe df as input variable,
- return the same dataframe df with all the cleaning steps performed on it.
- Add comments to explain each cleaning step.

Test your function with your dataframe !

In [20]:
def clean_bike_data(df):
    # drop duplicates
    df=df.drop_duplicates()
    # drop column 'angelegt_am' and 'erfassungsgrund' - irrelevant to us, when and why observation got added to the database.
    df.drop(['angelegt_am', 'erfassungsgrund'], axis=1, inplace=True)

    # we have just 167 attempts and 7 thefts of unknown state in our dataset, so we decide to drop those observations.
    df = df[(df['versuch'] != 'Ja') & (df['versuch'] != 'Unbekannt')]

    # change date text string to datetime datatype
    df['tatzeit_anfang_datum'] = pd.to_datetime(df['tatzeit_anfang_datum'], format='%d.%m.%Y')
    df['tatzeit_ende_datum'] = pd.to_datetime(df['tatzeit_ende_datum'], format='%d.%m.%Y')
    
    
    return df

In [21]:
# test your function

# read in the raw data again
# proper encoding is necessary here!
thefts_df_test = pd.read_csv('../../data/Fahrraddiebstahl.csv', encoding='latin-1') 
 # make column names lowercase
thefts_df_test.columns = thefts_df_test.columns.str.lower() 
thefts_df_test.head(2)

clean_df = clean_bike_data(thefts_df_test)

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df.drop(['angelegt_am', 'erfassungsgrund'], axis=1, inplace=True)


In [22]:
clean_df

Unnamed: 0,tatzeit_anfang_datum,tatzeit_anfang_stunde,tatzeit_ende_datum,tatzeit_ende_stunde,lor,schadenshoehe,versuch,art_des_fahrrads,delikt
0,2020-09-10,10,2020-09-10,12,3400723,706,Nein,Damenfahrrad,Fahrraddiebstahl
1,2020-09-09,16,2020-09-10,7,9200716,220,Nein,Damenfahrrad,Fahrraddiebstahl
2,2020-09-10,15,2020-09-10,18,6100207,550,Nein,Herrenfahrrad,Fahrraddiebstahl
3,2020-09-10,20,2020-09-10,21,1300733,548,Nein,Herrenfahrrad,Fahrraddiebstahl
4,2020-09-09,22,2020-09-10,11,8100207,700,Nein,Fahrrad,Fahrraddiebstahl
...,...,...,...,...,...,...,...,...,...
39402,2021-08-06,18,2021-08-09,8,1100309,600,Nein,Fahrrad,Fahrraddiebstahl
39403,2021-08-07,13,2021-08-09,8,1200522,3300,Nein,Herrenfahrrad,Fahrraddiebstahl
39404,2021-08-07,11,2021-08-09,9,6100102,499,Nein,Damenfahrrad,Fahrraddiebstahl
39405,2021-08-09,13,2021-08-09,14,2200211,300,Nein,Damenfahrrad,Fahrraddiebstahl
