# Combining Datasets & Pivot Tables with Pandas

## Learning Objectives

At the end of this notebook you should be able to
- combine DataFrames with Pandas
- describe the different joining methods (how to join DataFrames)
- create pivot tables with Pandas
- clone ("copy") conda environments

Pandas functions that allow us to combine two sets of data include the use of `pd.merge()`, `df.join()`, `df.merge()`, and `pd.concat()`. For the most part, these do largely the same things (although you'll notice the slight syntax difference with `merge()` and `concat()` being able to be called via the Pandas module and `merge()` and `join()` being able to be called on a DataFrame instance).   
There are some cases where one of these might be better than another in terms of writing less code or performing some kind of data combination in an easier way. The major differences between these, though, largely depend on what they do by default when you try to combine different data. By default, `merge()` looks to join on common columns, `join()` on common indices, and `concat()` by just appending on a given axis.

You can find more detail about the differences between all three of these in the [docs](https://pandas.pydata.org/pandas-docs/stable/user_guide/merging.html). We'll look at some examples below. 

## But first: cloning your environment

Why's that?  
In this notebook we want to use the GeoPandas package, which is based on an _open source project to add support for geographic data to pandas objects._ - in brief: we will have a dataframe with an additional geometric datatype.  

Since we usually don't need packages for geospatial data, we don't want to load it everytime we activate our usual nf_base environment.  
The good thing is:  
In conda, it is possible to easily copy, or better, clone an existing environment,  
so we will make use of this, create a new environment called _nf_geo_ based on _nf_base_, add the GeoPandas package to it and then use this new environment for this notebook.

To do so, let's clone your hopefully properly set up nf_base environment:

In [None]:
# clone the nf_base - this may take a few seconds up to two minutes ...
!conda create --clone nf_base --name nf_geo

Since GeoPandas is available via the conda forge channel, you may have to enable this channel first: 

In [None]:
# add conda forge to your conda channels
!conda config --add channels conda-forge

Then install the package directly into the newly built environment:

In [None]:
# This somehow means "conda install from conda forge into the nf_geo environment (-n nf_geo) not asking for confirmation (-y): package geopandas" 
!conda install -c conda-forge -n nf_geo -y geopandas

*(If this last step takes longer than up to a minute and there's a message telling you conda is "solving environment", please reach out to us.)*

---

Now that we have our new nf_geo environment, activate it for this jupyter notebook (choose the kernel) and we're ready to import our needed modules:

In [None]:
# standard import of pandas
import pandas as pd

# additional import of the geopandas package
import geopandas as gpd

# numpy, "numerical python" - we'll cover this in the following notebooks.
import numpy as np

# hides warning messages
import warnings
warnings.filterwarnings("ignore")

## Loading the first dataset
The data we'll use is data on bicycle theft crimes at the granular level of Berlin city planning areas, so-called "LOR" - "Lebensweltlich orientierte Räume", we will stumble over it again later!  
This data is provided by Berlin Open Data and collected by the police of Berlin.  

### The goal is: To be able to identify areas in Berlin with the most bike thefts or the highest theft amounts  

But first things first: We make the data accessible just by loading the .csv-file into a dataframe and get an overview.

[Website to datatset -  daten.berlin.de](https://daten.berlin.de/datensaetze/fahrraddiebstahl-berlin)

- Licence:
    - Creative Commons Namensnennung CC-BY License
- Geographical Granularity: 
    - Berlin
- Publisher: 
    - Polizei Berlin LKA St 14
- E Mail: 
    - onlineredaktion@polizei.berlin.de

In [None]:
thefts_df = pd.read_csv('data/Fahrraddiebstahl.csv', encoding='latin-1') # proper encoding is necessary here!
thefts_df.columns = thefts_df.columns.str.lower()  # make column names lowercase
thefts_df.head(2)

In [None]:
# what's the shape, the observations, datatypes and null-counts?
thefts_df.info()

Let's quickly think about cleaning our data:

- drop duplicates? inspect!
- drop column 'angelegt_am' and 'erfassungsgrund' - irrelevant to us
- column 'versuch': inspect!  
- column 'tatzeit_anfang_datum': change date string to datetime format  
- column 'tatzeit_anfang_ende': change date string to datetime format

In [None]:
# inspect duplicates
thefts_df[thefts_df.duplicated(keep=False)].sort_values(by=['tatzeit_anfang_datum', 'schadenshoehe']).tail(6)

In [None]:
# the specifications of the duplicates indicate that they are implausible, so we drop them.
# drop duplicates (rows by default)
thefts_df.drop_duplicates(inplace=True)

In [None]:
# drop column 'angelegt_am' and 'erfassungsgrund' - irrelevant to us, when and why observation got added to the database.
thefts_df.drop(columns='angelegt_am', inplace=True)
thefts_df.drop(columns='erfassungsgrund', inplace=True)


In [None]:
# how many unique values holds the column of the attempts?
thefts_df.versuch.unique()

In [None]:
# and what is the count of those categories?
thefts_df.versuch.value_counts()

In [None]:
# we have just 167 attempts and 7 thefts of unknown state in our dataset, so we decide to drop those observations.
thefts_df = thefts_df[thefts_df.versuch != 'Ja']
thefts_df = thefts_df[thefts_df.versuch != 'Unbekannt']

In [None]:
# change date text string to datetime datatype
thefts_df['tatzeit_anfang_datum'] = pd.to_datetime(thefts_df['tatzeit_anfang_datum'])
thefts_df['tatzeit_ende_datum'] = pd.to_datetime(thefts_df['tatzeit_ende_datum'])

In [None]:
# now that the dates are not only strings anymore, we can have a look at the timeframe
thefts_df.tatzeit_anfang_datum.min(), thefts_df.tatzeit_ende_datum.max()

In [None]:
# ... or can even do calculations on the date fields
thefts_df.tatzeit_ende_datum.max() - thefts_df.tatzeit_anfang_datum.min()

In [None]:
# confirm the new datatypes
thefts_df[['tatzeit_anfang_datum', 'tatzeit_ende_datum']].info()

Now that we're done cleaning our dataset, the idea is to impute it by using categorical data to so called "dummy variables".  
Such a variable (aka indicator variable) is a numeric variable representing categorical data by giving each category an own column and assign a 0 or 1 to it.  

We'll use this on the "Art des Fahrrads" column, the type of bike.

In [None]:
# A glance at the values of the type of bikes in the dataframe
thefts_df.art_des_fahrrads.unique()

In [None]:
# get_dummies is a method called on the pandas module - you simply pass in a Pandas Series 
# or DataFrame, and it will convert a categorical variable into dummy/indicator variables. 
# The idea of dummy coding is to convert each category into a new column, and assign a 1 or 0 to the column.
# This can be an important step during data preparation for machine learning.

# creating a dataset of type of bike dummy variables.
biketype_dummies = pd.get_dummies(thefts_df.art_des_fahrrads, prefix='type')
biketype_dummies.head()

## Combining dataframes

### Join()
Now let's look at the `join()` method. It joins on indices by default and is called on a dataframe instance. This means that we can simply join our bike type dummies dataframe back to our original bike thefts dataframe with the following code:

In [None]:
# Joining columns of another DataFrame using the join() method.
join_df = thefts_df.join(biketype_dummies)
join_df.head()

In [None]:
# Let's have a look at the columns of our newly assigned dataframe
join_df.info()

The arguments of `.join` are the following: 
````
DataFrame.join(other, on=None, how='left', lsuffix='', rsuffix='', sort=False)
````
The documentation refers to the second dataframe as 'other', which the documentations of the other combining methods often refer to as 'right'.  
With `how` we can specify which join method we want to use.

If we want to join using a common column, we need to set this column to be the index in both dataframes. The joined DataFrame will have the common column as its index.

```
df.set_index('column_name').join(other.set_index('column_name'))
```

Another option to join using a common column is to use the on parameter. This method preserves the original DataFrame’s index in the result.
```
df.join(other.set_index('column_name'), on='column_name')
```
See the documentation [here](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.join.html).

The how argument to merge specifies which keys are included in the resulting table. If a key combination does not appear in either the left or right tables, the values in the joined table will be NA. Here is a summary of the how options and their SQL equivalent names:

Merge method | SQL Join Name | Description
---|---|---
left| LEFT OUTER JOIN | Use keys from left frame only
right | RIGHT OUTER JOIN | Use keys from right frame only
outer | FULL OUTER JOIN | Use union of keys from both frames
inner | INNER JOIN | Use intersection of keys from both frames


You can also think of it as set theory and use Venn diagrams to illustrate what happens in each method.

![Join Methods](./images/join_types.png)

### Merge()
Let's look at the `merge()` method. Merge combines dataframes on column columns by default and can be used via the pandas module AND called on a dataframe instance.

The arguments of `.merge` are the following: 
````
DataFrame.merge(right, how='inner', on=None, left_on=None, right_on=None, left_index=False, right_index=False, sort=False,   
suffixes=('_x', '_y'), copy=True, indicator=False, validate=None)
````
See the documentation [here](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.merge.html).

In [None]:
# Since in both dataframes, we need a common column.
# Let's use the index column as the one to merge on:
thefts_df_ind = thefts_df.reset_index()
biketype_dummies_ind = biketype_dummies.reset_index()

In [None]:
# check result - you will see a new column called index in the dataframe
thefts_df_ind.head()

In [None]:
# check result - you will see a new column called index in the dataframe
biketype_dummies_ind.head()

In [None]:
# Merge the quality_dummies df on the thefts_df instance on the common column 'index'
merge_df1 = thefts_df_ind.merge(biketype_dummies_ind, on='index')
merge_df1.head()

In [None]:
# Or another way: Merge the two dataframes via the pandas module on the common column 'index'
merge_df2 = pd.merge(thefts_df_ind, biketype_dummies_ind, on='index')
merge_df2.head()

### Concat()

Let's now look at concat.
`````
pandas.concat(objs, axis=0, join='outer', ignore_index=False, keys=None, levels=None, names=None,    
verify_integrity=False, sort=False, copy=True)
`````
Different from join and merge, which by default operate on columns, concat can define whether to operate on columns or rows.

See the documentation [here](https://pandas.pydata.org/docs/reference/api/pandas.concat.html).

In [None]:
concat_df = pd.concat([biketype_dummies, thefts_df], axis=1)
concat_df.head()

In the images below, you can see the differences, if axis is set as 0 or 1.

**Concat with axis=0:**
![Concat Axis 0](./images/concat_axis_0.png)

---

**Concat with axis=1:**
![Concat Axis 1](./images/concat_axis_1.png)

(The pictures were part of [this](https://towardsdatascience.com/python-pandas-dataframe-join-merge-and-concatenate-84985c29ef78) blog post.)

---

## Check your understanding

Leaving bike thefts aside,  
1. Please join the two given dataframes (df1 and df2) along rows and merge with the third (df3) dataframe along the common column id.  
If any key combinations are not present, these should be filled with NaNs.


In [None]:
df1 = pd.DataFrame({
        'student_id': ['S1', 'S2', 'S3', 'S4', 'S5'],
         'name': ['Erika Raaf', 'Nadja Berens', 'Florentin Kleist', 'Dorothea Eibl', 'Gerhard Bihlmeier'], 
        'subject': ['Math', 'Biology', 'Biology', 'English', 'Philosophy']})
df2 = pd.DataFrame({
        'student_id': ['S6', 'S7', 'S8'],
        'name': ['Jens Hüls', 'Vera Kagan', 'Paula Brodersen'], 
        'subject': ['Math', 'Math', 'Social Science']})
df3 = pd.DataFrame({
        'student_id': ['S1', 'S2', 'S3', 'S4', 'S5', 'S7', 'S8', 'S9', 'S10', 'S11', 'S12', 'S13'],
        'marks': [23, 45, 12, 67, 21, 55, 33, 14, 56, 83, 88, 12]})

2. You have received some weather data (temperature) of the last year. For each month the average temperature was measured, only for a few months the maximum temperature could be measured. Anyway, you want to combine these two data without losing any information.

In [None]:
weather_mean_data = {'Mean TemperatureF': [53.1, 70., 34.93548387, 28.71428571, 32.35483871, 72.87096774, 70.13333333, 35., 62.61290323, 39.8, 55.4516129 , 63.76666667],
                     'Month': ['Apr', 'Aug', 'Dec', 'Feb', 'Jan', 'Jul', 'Jun', 'Mar', 'May', 'Nov', 'Oct', 'Sep']}
weather_max_data = {'Max TemperatureF': [68, 89, 91, 84], 'Month': ['Jan', 'Apr', 'Jul', 'Oct']}


3. (Extra question: Can you fill in the average max. Temperature for the missing values in the Column `Max TemperatureF`)

---

## Combining multiple data sources

Remember we initially said, we wanted to be able to identify areas in Berlin with the most bike thefts?  
So far, we can't.  

We have a lot of features describing the actual bike thefts, but we have nothing to really spot the area where it happens. The only thing we have in our dataframe is this suspicious "LOR" - so we have to do some research on it, if and how we can use it ...  

The [dataset description](https://www.berlin.de/polizei/_assets/dienststellen/lka/datensatzbeschreibung.pdf) at Berlin Open Data tells us about the LOR column:
- Kennung des Planungsraums, 8-stellig
- Raumhierarchie lebensweltlich orientierte Räume (LOR) der Senatsverwaltung für Stadtentwicklung und
Wohnen

Wow. _Raumhierarchie lebensweltlich orientierte Räume_ - that's where you know you deal with authorities. 
Since we don't have any Ideas what that means, we google it and find, that at the Website of [stadtentwicklung.berlin.de](https://www.stadtentwicklung.berlin.de/planen/basisdaten_stadtentwicklung/lor/de/download.shtml) there are LOR associated vector data files, .shp "shapefiles". So we have a look at them, too ...


We now access the shapefiles and try to combine them with our biketheft data.

In [None]:
# assign a geodataframe based on the shapefile
gdf = gpd.GeoDataFrame.from_file('data/LOR_SHP_2021/lor_plr.shp')
gdf.columns = gdf.columns.str.lower()
gdf.head(5)

In [None]:
gdf.info()

So we see, this gave us a dataframe with obviously the LOR as plr_id, the district name and the geometrical shape of the area as a polygon.

##### Polygon? What was that again?

<img src="images/geometries.jpg" alt="geometries" width="500"/>

So, those polygons should give us areas of Berlin. Let's give it a try: 

In [None]:
# plotting the geometries
gdf.plot(color='grey', figsize=(12, 12));

That somehow looks like Berlin which makes us quite confident to proceed to try to merge the sets, since our bike theft data is not yet inside our geodataframe (or vice versa) - those are still two seperate data sets.  

So - we need to have a look at the column that allow us to merge ...

In [None]:
# bike thefts lor column
thefts_df.lor.info()

In [None]:
# geodataframe lor column
gdf.plr_id.info()

Not that easy, again.  
- The column 'lor' in the bike theft data is an integer.  
- Integers as numeric values can't have leading zeros.  
- That's why it sometimes is 8 digits, sometimes is just 7 digits long - it then misses a leading 0 - we need to impute!  

In the geodataframe, the lor column is an object, which means a string in this case.  
Feel free to have a closer look ...

In [None]:
# changing the lor column datatype to string
thefts_df['lor_str'] = thefts_df['lor'].astype('str')

# fill leading gaps up to 8 characters with zeros and call the new column accordingly to the geodataframe
thefts_df['plr_id'] = thefts_df['lor_str'].apply(lambda x: x.zfill(8))

# dropping no longer needed columns
thefts_df.drop(columns=['lor', 'lor_str'], inplace=True)

thefts_df.info()

In [None]:
# compare with the geodataframe
gdf.info()

Now, we are able to merge our dataframes

In [None]:
# merging dataframes on the plr_id columns
gdf_biketheft = gdf.merge(thefts_df, on='plr_id')
gdf_biketheft.info()

And so, we are finally able to infer infer the are with the most bikes stolen  
by aggregating count of thefts:

aggregating count of thefts

In [None]:
# counting thefts in areas
df_plr_group_thefts = gdf_biketheft.groupby(by='plr_id').size().reset_index(name='thefts')

# showing new dataframe with plr_id and aggregated count of thefts
df_plr_group_thefts.head()

aggregating average theft amounts

In [None]:
# counting thefts in areas
df_plr_group_mean = gdf_biketheft.groupby(by='plr_id').mean().reset_index()
df_plr_group_mean = df_plr_group_mean[['plr_id', 'schadenshoehe']]
df_plr_group_mean = df_plr_group_mean.astype({'schadenshoehe': 'int64'})
df_plr_group_mean.rename(columns={'schadenshoehe': 'avg_amount'}, inplace=True)

# showing new dataframe with plr_id and aggregated mean of thefts
df_plr_group_mean.head()

In [None]:
#merging the aggregates into the initial geodataframe
gdf_biketheft = gdf.merge(df_plr_group_thefts, on='plr_id')
gdf_biketheft = gdf_biketheft.merge(df_plr_group_mean, on='plr_id')
gdf_biketheft.info()

In [None]:
# so, which area is the winner?
gdf_biketheft[gdf_biketheft.thefts == gdf_biketheft.thefts.max()][['plr_name', 'thefts', 'avg_amount']]

And here we have our winner - it is __Alt-Treptow with 501 thefts__ in the observed timeframe with an average theft amount of 791 Euro!  

---

Congratulations!  
You made it through another intense notebook - but we hope the little excursions brought some fun ...