### Clean dataset on stolen bikes.


In [1]:
# standard import of pandas
import pandas as pd

## Loading the first dataset
The data we'll use is data on bicycle theft crimes at the granular level of Berlin city planning areas, so-called "LOR" - "Lebensweltlich orientierte Räume", we will stumble over it again later!  
This data is provided by Berlin Open Data and collected by the police of Berlin.  

### The goal for later: To be able to identify areas in Berlin with the most bike thefts or the highest theft amounts  
### The goal for today: clean this dataset to prepare it for our data analysis

First things first: We make the data accessible just by loading the .csv-file into a dataframe and get an overview.

[Website to datatset -  daten.berlin.de](https://daten.berlin.de/datensaetze/fahrraddiebstahl-berlin)

- Licence:
    - Creative Commons Namensnennung CC-BY License
- Geographical Granularity: 
    - Berlin
- Publisher: 
    - Polizei Berlin LKA St 14
- E Mail: 
    - onlineredaktion@polizei.berlin.de

In [None]:
# proper encoding is necessary here!
thefts_df_raw = pd.read_csv('../../data/Fahrraddiebstahl.csv', encoding='latin-1') 
#thefts_df_raw
 # make column names lowercase
thefts_df_raw.columns = thefts_df_raw.columns.str.lower() 
thefts_df_raw.head(2)

In [None]:
# what's the shape, the observations, datatypes and null-counts?
thefts_df_raw.shape

In [None]:
thefts_df_raw.info()

In [None]:
thefts_df_raw.describe() # includes objects will include only the numeric data type

In [None]:
thefts_df_raw.describe(include='O') # includes objects only with 0 argument passed

In [None]:
thefts_df_raw.describe(include='all')

In [None]:
thefts_df_raw.dtypes

In [None]:
thefts_df_raw.isnull().sum()

In [None]:
thefts_df_raw.isna().sum()

Let's think about cleaning our data:

- drop duplicates? inspect!
- drop column 'angelegt_am' and 'erfassungsgrund' - irrelevant to us
- column 'versuch': inspect!  
- column 'tatzeit_anfang_datum': change date string to datetime format  
- column 'tatzeit_anfang_ende': change date string to datetime format

In [11]:
# inspect duplicates
duplicates = thefts_df_raw[thefts_df_raw.duplicated(keep=False)]
# keep=False => all duplicates are set as True
# keep='first' => first is set as False, rest duplicates are True
# keep='last' => last is set as False, rest duplicates are False

In [None]:
duplicates

In [None]:
# inspect duplicates
duplicates.sort_values(by=['tatzeit_anfang_datum', 'schadenshoehe'])\
    .tail(6)

# backspace means the code continues in the next line

In [None]:
# total nr of duplicates
len(duplicates)

In [15]:
# the specifications of the duplicates indicate that they are implausible, so we drop them.
# drop duplicates and assign result to a new dataframe called 'thefts_df_dedup'

thefts_df_dedup = thefts_df_raw.drop_duplicates().copy()


In [None]:
# Always double check your results
print('thefts_df_raw count: '+str(len(thefts_df_raw)))
print('thefts_df_dedup: '+ str(len(thefts_df_dedup)))
print('difference: '+ str(len(thefts_df_raw)-len(thefts_df_dedup)))


In [None]:
# does this match with our duplicates?
# the 96 means there were 96 duplicated rows deleted

# number of rows that appear more than one time (not just twice, also three or four ... times)
print('nr of duplicates: '+ str(len(duplicates)))
#print(f'nr of duplicates: {len(duplicates)}')

# the remaining (unique) rows
print('nr of unique rows in duplicates: '+ str(len(duplicates.drop_duplicates())))
# this means the deleted rows, so these are the rows that appear multiple times (not just twice)
print('nr of duplicated rows in duplicates: '+ str(len(duplicates)-len(duplicates.drop_duplicates())))


In [None]:
thefts_df_dedup['schadenshoehe'].size

# same result as len()

In [19]:
# data.duplicated().value_counts()

do the numbers make sense to you? 

In [20]:
# in worst case, if this is really confusing, you can download and double check manually in Excel
# thefts_df_raw[thefts_df_raw.duplicated(keep=False)]\
#     .sort_values(by=['tatzeit_anfang_datum', 'schadenshoehe'])\
#         .to_csv('./check.csv')

...if yes, let's continue..

In [None]:
# drop column 'angelegt_am' and 'erfassungsgrund' - irrelevant to us, when and why observation got added to the database.

#thefts_new = thefts_df_dedup.drop('angelegt_am', axis='columns')
#thefts_newly = thefts_new.drop('erfassungsgrund', axis=1)
#thefts_newly.head()

thefts_df_dedup.drop(['angelegt_am', 'erfassungsgrund'], axis=1, inplace=True) # copy before inplace=True to keep the original df
thefts_df_dedup.head()

In [None]:
# how many unique values holds the column of the attempts?
# look up 'unique()' and try to understand what it's doing

thefts_df_dedup.versuch.unique()

# it shows the different values in this column

In [None]:
# and what is the count of those categories?
# look up 'value_counts()' and try to understand what it's doing

thefts_df_dedup.versuch.value_counts()

# it counts how often a specific value is written

In [24]:
#versuch_unbekannt = thefts_df_dedup[thefts_df_dedup['versuch'] == 'Unbekannt']
#versuch_unbekannt

In [25]:
#versuch_ja_unbekannt = thefts_df_dedup[(thefts_df_dedup['versuch'] == 'Ja') | (thefts_df_dedup['versuch'] == 'Unbekannt')]
#versuch_ja_unbekannt

In [26]:
# we have just 167 attempts and 7 thefts of unknown state in our dataset, so we decide to drop those observations.

#thefts_df_dedup.drop(thefts_df_dedup[(thefts_df_dedup['versuch'] == 'Ja') | (thefts_df_dedup['versuch'] == 'Unbekannt')].index)

In [27]:
thefts_df_dedup.set_index('versuch', inplace=True)
thefts_df_dedup.drop(['Ja', 'Unbekannt'], inplace=True)

In [28]:
#thefts_df_dedup[(thefts_df_dedup['versuch'] == 'Ja') | (thefts_df_dedup['versuch'] == 'Unbekannt')]

In [None]:
thefts_df_dedup

In [30]:
#thefts_df_dedup.drop(1)

In [None]:
type(thefts_df_dedup['tatzeit_anfang_datum'])

In [None]:
# change date text string to datetime datatype
# fill in the gap....
thefts_df_dedup['tatzeit_anfang_datum'] = pd.to_datetime(thefts_df_dedup['tatzeit_anfang_datum'], format='mixed') # Y year 20xx, y year in xx
thefts_df_dedup['tatzeit_ende_datum'] = pd.to_datetime(thefts_df_dedup['tatzeit_ende_datum'], format='mixed')
thefts_df_dedup.info()

In [None]:
# now that the dates are not only strings anymore, we can have a look at the timeframe
thefts_df_dedup.tatzeit_anfang_datum.min(), thefts_df_dedup.tatzeit_ende_datum.max()

In [None]:
# ... or can even do calculations on the date fields
thefts_df_dedup.tatzeit_ende_datum.max() - thefts_df_dedup.tatzeit_anfang_datum.min()

In [None]:
# confirm the new datatypes
thefts_df_dedup[['tatzeit_anfang_datum', 'tatzeit_ende_datum']].info()

### Yay!  We're done with cleaning our dataset :-) 

Now, we want to re-use this code later. Let's wrap all the final cleaning steps that we came up with into a function. The function should:
- be called 'clean_bike_data',
- have a dataframe df as input variable,
- return the same dataframe df with all the cleaning steps performed on it.
- Add comments to explain each cleaning step.

Test your function with your dataframe !

In [36]:
import pandas as pd

In [37]:
def clean_bike_data(df):
    # drop duplicates
    df = df.drop_duplicates().copy()
    # drop column 'angelegt_am' and 'erfassungsgrund' - irrelevant to us, when and why observation got added to the database.
    df.drop(['angelegt_am', 'erfassungsgrund'], axis=1, inplace=True)
    #df = df.drop(columns=['angelegt_am', 'erfassungsgrund']) # alternative zu zeile drüber
    # we have just 167 attempts and 7 thefts of unknown state in our dataset, so we decide to drop those observations.
    df.set_index('versuch', inplace=True)
    df.drop(['Ja', 'Unbekannt'], inplace=True)
    # change date text string to datetime datatype
    df['tatzeit_anfang_datum'] = pd.to_datetime(df['tatzeit_anfang_datum'], format='%d.%m.%Y')
    df['tatzeit_ende_datum'] = pd.to_datetime(df['tatzeit_ende_datum'], format='mixed')
    
    return df

In [38]:
# test your function

# read in the raw data again
# proper encoding is necessary here!
thefts_df_test = pd.read_csv('../../data/Fahrraddiebstahl.csv', encoding='latin-1') 
 # make column names lowercase
thefts_df_test.columns = thefts_df_test.columns.str.lower() 
thefts_df_test.head(2)

clean_df = clean_bike_data(thefts_df_test)

In [None]:
clean_df