# Notebook

In this notebook in the data preparation step, the datasets for cyclist and weather data are loaded, column names are adjusted, data types checked and aligned and then exported to be used with the notebook [2_EDA](https://github.com/Rudinius/Bike_usage_Bremen/blob/4f39d66836e0585770c37d1cf261b0c0dd95101f/2_EDA.ipynb).

The datasets for geo-locations, vacations and holidays are self created from online scources and therefore do not need any further data preparation.

<a name="content"></a>
# Content

* [1. Import libraries and mount drive](#1)
* [2. Import datasets](#2)
    * [2.1 Cyclists dataset](#2.1)
        * [2.1.1 Description](#2.1.1)
        * [2.1.2 Preparing cyclists dataset](#2.1.2)
    * [2.2 Weather dataset](#2.2)
        * [2.2.1 Description](#2.2.1)
        * [2.2.2 Preparing weather dataset](#2.2.2)
    * [2.3 Public holidays](#2.3)
        * [2.3.1 Description](#2.3.1)
        * [2.3.2 Preparing public holidays dataset](#2.3.2)
    * [2.4 School vacations](#2.4)
        * [2.4.1 Description](#2.4.1)
        * [2.4.2 Preparing school vacations dataset](#2.4.2)
    * [2.5 Geolocations](#2.5)
        * [2.5.1 Description](#2.5.1)
        * [2.5.2 Preparing geolocations dataset](#2.5.2)
    * [3.0 Export datasets](#3.0)

<a name="1"></a>
# 1.&nbsp;Import libraries
[Content](#content)

In [58]:
# Import libraries
import datetime
import numpy as np
import pandas as pd
from google.colab import files

  and should_run_async(code)


In [59]:
# Install package pyjanitor since it is not part of the standard packages
# of Google Colab

import importlib

# Check if package is installed
package_name = "pyjanitor"
spec = importlib.util.find_spec(package_name)
if spec is None:
    # Package is not installed, install it via pip
    !pip install pyjanitor
else:
    print(f"{package_name} is already installed")

import janitor

  and should_run_async(code)




<a name="2"></a>
# 2.&nbsp;Import datasets

[Content](#content)

Next, the raw datasets of the projects data folder for the number of cyclists and weather data will be imported.

* cyclists_2013-2021_daily.csv Dataset with dailz values of different measuring points in Bremen from 01.01.2013 to 31.12.2022
* weather_2013-2020.csv Dataset with daily weather values like min, max temperature, rainfall,... from 01.01.2023 to 31.12.2020
* weather_2021-2022.csv Dataset with daily weather values like min, max temperature, rainfall,... from 01.01.2021 to 31.12.2022

The data of amount of cyclists for different counting stations has been taken from [VMZ Bremen](https://vmz.bremen.de/rad/radzaehlstationen-abfrage). Each column is the name of a different counting station.

The weather data has been imported from [Meteostat](https://meteostat.net/).

In [60]:
# Set base url
url = "https://raw.githubusercontent.com/Rudinius/Bike_usage_Bremen/main/data/"

  and should_run_async(code)


<a name="2.1"></a>
## 2.1 Cyclists dataset

[Content](#content)

<a name="2.1.1"></a>
### 2.1.1 Description

[Content](#content)

The cyclist dataset contains the daily amount of cyclists for different counting stations. Each column is the name of a different counting station and containing the value of counted cyclers for a given day

<a name="2.1.2"></a>
### 2.1.2 Preparing cyclists dataset

[Content](#content)

First we will change the names to shorter names and to eliminate spaces and German special characters. For this we will use the `pyjanitor` package.

Also we will change the name of the index colum as well from `Zeitpunkt ('Y-m-d H:i:s')` to `date`.

Lastly, for the cylist dataset, the rows are not in an ordered way. Therefore before we export the new dataset, we have to sort the index.

In [61]:
# Import datasets

# The original csv file uses ';' as a seperator. We will also parse the date column as datetime64
df_cyclist = pd.read_csv(url + "raw_cyclists_2013-2022.csv", sep= ";",
                         parse_dates=[0], index_col=[0]).clean_names(strip_underscores="both").sort_index()

# Apply new name to index
df_cyclist.index.names = ['date']

  and should_run_async(code)


In [62]:
df_cyclist.info()

<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 3652 entries, 2013-01-01 to 2022-12-31
Data columns (total 12 columns):
 #   Column                        Non-Null Count  Dtype  
---  ------                        --------------  -----  
 0   graf_moltke_straße_ostseite   3622 non-null   float64
 1   graf_moltke_straße_westseite  3576 non-null   float64
 2   hastedter_bruckenstraße       3636 non-null   float64
 3   langemarckstraße_ostseite     3639 non-null   float64
 4   langemarckstraße_westseite    3651 non-null   float64
 5   osterdeich                    3651 non-null   float64
 6   radweg_kleine_weser           3550 non-null   float64
 7   schwachhauser_ring            3652 non-null   int64  
 8   wachmannstraße_auswarts_sud   3561 non-null   float64
 9   wachmannstraße_einwarts_nord  3474 non-null   float64
 10  wilhelm_kaisen_brucke_ost     3652 non-null   int64  
 11  wilhelm_kaisen_brucke_west    3606 non-null   float64
dtypes: float64(10), int64(2)
memory usage: 370.9

  and should_run_async(code)


In [63]:
df_cyclist.head()

  and should_run_async(code)


Unnamed: 0_level_0,graf_moltke_straße_ostseite,graf_moltke_straße_westseite,hastedter_bruckenstraße,langemarckstraße_ostseite,langemarckstraße_westseite,osterdeich,radweg_kleine_weser,schwachhauser_ring,wachmannstraße_auswarts_sud,wachmannstraße_einwarts_nord,wilhelm_kaisen_brucke_ost,wilhelm_kaisen_brucke_west
date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1
2013-01-01,261.0,290.0,381.0,312.0,308.0,870.0,410.0,391,514.0,267.0,1228,563.0
2013-01-02,750.0,876.0,1109.0,1258.0,1120.0,2169.0,1762.0,829,1786.0,1456.0,4024,2355.0
2013-01-03,931.0,1015.0,1603.0,1556.0,1480.0,2295.0,2287.0,1196,2412.0,2035.0,5013,3028.0
2013-01-04,500.0,587.0,1284.0,703.0,626.0,1640.0,1548.0,1418,964.0,702.0,2382,1121.0
2013-01-05,1013.0,1011.0,0.0,1856.0,1621.0,4128.0,4256.0,3075,2065.0,1377.0,5736,3221.0


<a name="2.2"></a>
## 2.2 Weather dataset

[Content](#content)

<a name="2.2.1"></a>
### 2.2.1 Description

[Content](#content)

The weather dataset contains apart from the date, the daily weather values in the columns. The values are:


`tavg`: Avg. Temperature in °C <br>
`tmin`: Min. Temperature in °C <br>
`tmax`: Max. Temperature in °C <br>
`prcp`: Total Precipitation in mm <br>
`snow`: Snowheight in mm <br>
`wdir`: Wind Direction in ° <br>
`wspd`: Wind Speed in km/h ° <br>
`wpgt`: Wind Peak Gust in km/h ° <br>
`pres`: Air Pressure in hPa ° <br>
`tsun`: Sunshine Duration in minutes ° <br>

The weather dataset is split into two files. One ranging from 2013 to 2020 and one from 2020 to 2022. Here we will concatenate both files and export as one.

<a name="2.2.2"></a>
### 2.2.2 Preparing weather dataset

[Content](#content)

In [64]:
# Import datasets

# The weather data is splitted among two files. Both files will be read seperately and
# concatenated
df_weather_a = pd.read_csv(url + "raw_weather_2013-2020.csv", parse_dates=[0], index_col=[0])
df_weather_b = pd.read_csv(url + "raw_weather_2021-2022.csv", parse_dates=[0], index_col=[0])
df_weather = pd.concat([df_weather_a, df_weather_b], axis=0)

  and should_run_async(code)


In [65]:
df_weather.info()

<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 3652 entries, 2013-01-01 to 2022-12-31
Data columns (total 10 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   tavg    3652 non-null   float64
 1   tmin    3652 non-null   float64
 2   tmax    3652 non-null   float64
 3   prcp    3652 non-null   float64
 4   snow    3439 non-null   float64
 5   wdir    3642 non-null   float64
 6   wspd    3652 non-null   float64
 7   wpgt    3651 non-null   float64
 8   pres    3652 non-null   float64
 9   tsun    3652 non-null   int64  
dtypes: float64(9), int64(1)
memory usage: 313.8 KB


  and should_run_async(code)


In [66]:
df_weather.head()

  and should_run_async(code)


Unnamed: 0_level_0,tavg,tmin,tmax,prcp,snow,wdir,wspd,wpgt,pres,tsun
date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
2013-01-01,6.9,3.5,9.1,6.9,0.0,233.0,19.4,50.4,1001.8,0
2013-01-02,5.6,4.2,7.1,1.8,0.0,246.0,20.2,40.0,1017.5,30
2013-01-03,8.6,6.0,10.6,0.9,0.0,257.0,23.8,45.7,1024.5,0
2013-01-04,8.8,6.8,9.7,0.0,0.0,276.0,25.2,48.2,1029.5,0
2013-01-05,7.7,6.5,8.6,0.1,0.0,293.0,20.2,41.0,1029.9,0


<a name="2.3"></a>
## 2.3 Public holidays

[Content](#content)

<a name="2.3.1"></a>
### 2.3.1 Description

[Content](#content)

The dataset `Holidays` contains all the public holidays between 2013 and 2022.

The coloumn contains the different holidays for each day (rows).

<a name="2.3.2"></a>
### 2.3.2 Preparing public holidays dataset

[Content](#content)

In [67]:
df_holidays = pd.read_csv(url + "raw_holidays_2013-2022.csv", parse_dates=[0], index_col=[0], sep=";")

  and should_run_async(code)


In [68]:
df_holidays.info()

<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 96 entries, 2013-01-01 to 2022-12-26
Data columns (total 1 columns):
 #   Column   Non-Null Count  Dtype 
---  ------   --------------  ----- 
 0   holiday  96 non-null     object
dtypes: object(1)
memory usage: 1.5+ KB


  and should_run_async(code)


In [69]:
df_holidays.head()

  and should_run_async(code)


Unnamed: 0_level_0,holiday
date,Unnamed: 1_level_1
2013-01-01,Neujahr
2013-03-29,Karfreitag
2013-04-01,Ostermontag
2013-05-01,Tag der Arbeit
2013-05-09,Christi Himmelfahrt


<a name="2.4"></a>
## 2.4 School Vacation

[Content](#content)

<a name="2.4.1"></a>
### 2.4.1 Description

[Content](#content)

The dataset `School Vacation` contains all the school vacation days in the state of Bremen between 2013 and 2022.

<a name="2.5.2"></a>
### 2.4.2 Preparation school vacation dataset

[Content](#content)

In [70]:
df_vacation = pd.read_csv(url + "raw_vacation_2013-2022.csv", parse_dates=[0], index_col=[0], sep=";")
df_vacation

  and should_run_async(code)


Unnamed: 0_level_0,vacation
date,Unnamed: 1_level_1
2013-01-01,Weihnachtsferien
2013-01-02,Weihnachtsferien
2013-01-03,Weihnachtsferien
2013-01-04,Weihnachtsferien
2013-01-05,Weihnachtsferien
...,...
2022-12-27,Weihnachtsferien
2022-12-28,Weihnachtsferien
2022-12-29,Weihnachtsferien
2022-12-30,Weihnachtsferien


<a name="2.5"></a>
## 2.5 Geolocations of stations

[Content](#content)

<a name="2.5.1"></a>
### 2.5.1 Description

[Content](#content)

For the geo-locations there is no dataset to import and the data is not available in the cyclist dataset. Thefore we added those locations manually from the [VMZ website](https://vmz.bremen.de/radzaehlstationen/).

The dataset geolocations contains the longitude and latitude values of the counting stations with the name of individual counting stations as rows and longitude and latitude as columns:

<a name="2.5.2"></a>
### 2.5.2 Preparing geolocations dataset

[Content](#content)

In [71]:
geolocations =  {"graf_moltke_straße_ostseite": (53.0778, 8.8330),
                 "graf_moltke_straße_westseite": (53.0781, 8.8328),
                 "hastedter_bruckenstraße": (53.0612, 8.8528),
                 "langemarckstraße_ostseite": (53.0764, 8.7974),
                 "langemarckstraße_westseite": (53.0765, 8.7969),
                 "osterdeich": (53.0693, 8.8198),
                 "radweg_kleine_weser": (53.0660, 8.8073),
                 "schwachhauser_ring": (53.0891, 8.8409),
                 "wachmannstraße_auswarts_sud": (53.0845, 8.8263),
                 "wachmannstraße_einwarts_nord": (53.0847, 8.8264),
                 "wilhelm_kaisen_brucke_ost": (53.0722, 8.8040),
                 "wilhelm_kaisen_brucke_west": (53.0726, 8.8040)
                }

  and should_run_async(code)


In [72]:
df_geolocations = pd.DataFrame.from_dict(geolocations, orient="index", columns=["latitude", "longitude"])
df_geolocations.index.names = ["name"]
df_geolocations.head()

  and should_run_async(code)


Unnamed: 0_level_0,latitude,longitude
name,Unnamed: 1_level_1,Unnamed: 2_level_1
graf_moltke_straße_ostseite,53.0778,8.833
graf_moltke_straße_westseite,53.0781,8.8328
hastedter_bruckenstraße,53.0612,8.8528
langemarckstraße_ostseite,53.0764,8.7974
langemarckstraße_westseite,53.0765,8.7969


**Export new dataset**

We will also export this newly created dataset for later use.

<a name="3.0"></a>
# 3.&nbsp;Export datasets

[Content](#content)

For easier handling of the further analysis of the datasets, we combine the cycling, weather, holidays and vacation datasets into one dataset.
Because the dataset with the geolocation data is differently formatted (not time series data) that is at the moment not of primary interest, we leave it seperated for now.

In [73]:
df_full = pd.concat([df_cyclist, df_weather, df_vacation, df_holidays], axis=1)
df_full.head()

  and should_run_async(code)


Unnamed: 0_level_0,graf_moltke_straße_ostseite,graf_moltke_straße_westseite,hastedter_bruckenstraße,langemarckstraße_ostseite,langemarckstraße_westseite,osterdeich,radweg_kleine_weser,schwachhauser_ring,wachmannstraße_auswarts_sud,wachmannstraße_einwarts_nord,...,tmax,prcp,snow,wdir,wspd,wpgt,pres,tsun,vacation,holiday
date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
2013-01-01,261.0,290.0,381.0,312.0,308.0,870.0,410.0,391,514.0,267.0,...,9.1,6.9,0.0,233.0,19.4,50.4,1001.8,0,Weihnachtsferien,Neujahr
2013-01-02,750.0,876.0,1109.0,1258.0,1120.0,2169.0,1762.0,829,1786.0,1456.0,...,7.1,1.8,0.0,246.0,20.2,40.0,1017.5,30,Weihnachtsferien,
2013-01-03,931.0,1015.0,1603.0,1556.0,1480.0,2295.0,2287.0,1196,2412.0,2035.0,...,10.6,0.9,0.0,257.0,23.8,45.7,1024.5,0,Weihnachtsferien,
2013-01-04,500.0,587.0,1284.0,703.0,626.0,1640.0,1548.0,1418,964.0,702.0,...,9.7,0.0,0.0,276.0,25.2,48.2,1029.5,0,Weihnachtsferien,
2013-01-05,1013.0,1011.0,0.0,1856.0,1621.0,4128.0,4256.0,3075,2065.0,1377.0,...,8.6,0.1,0.0,293.0,20.2,41.0,1029.9,0,Weihnachtsferien,


In [74]:
date = datetime.date.today()

# Save the new dataset to csv and download the file
file_name = f"{date}" + "_processed_" + "full.csv"
df_full.to_csv(file_name)
files.download(file_name)

# Save the geolocations dataset to csv and download the file
file_name = f"{date}" + "_processed_" + "geolocations.csv"

df_geolocations.to_csv(file_name)
files.download(file_name)

  and should_run_async(code)


<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>