In [1]:
import numpy as np
import matplotlib.pyplot as plt #for displaying plots
import pandas as pd
import seaborn as sns
import tensorflow as tf
import datetime
import random

In [2]:
# setting a seed to ensure reproducability and consistency
random.seed(16)
np.random.seed(16)
tf.random.set_seed(16)

In [3]:
plt.rcParams.update({'font.size': 13})

This notebook will mainly focus on data exploration and "enhancement".
- looking at outliers (but not doing anything with them)
- looking for duplicates
- looking for NaN values
- "Enhancements"
    - adding columns for school holidays, public holidays and holidays
    - adding columns for workday
    - adding a column with lockdown information

# Uploading the Data

#### Data Set Information:

Bicycle data in Karlsruhe, Germany.

#### Attribute Information
- `bike_count`: Number of bikes, which drove by the counting station during the day

- `temperature`: Temperature in °C

- `humidity`: Relative humidity in %

- `windspeed`: Windspeed in m/s, faulty values = -999

- `wind_direction`: Wind direction in °, faulty values = -999

- `precipitation`: Precipitation in mm (hourly sum)

- `precip_ind`: Precipitation indicator
  - 0 = no
  - 1 = yes
  - -999 = faulty

- `precip_type`:
  - 0 = no precipitation (conventional or automatic measurement)
  - 1 = just rain (in historical data before 01.01.1979)
  - 4 = Precipitation form unknown, although precipitation reported; Form of falling and deposited precipitation cannot be clearly determined with automatic measurement
  - 6 = Only rain; liquid precipitation in automatic measurement
  - 7 = Only snow; solid precipitation in automatic measurement
  - 8 = Rain and snow and/or sleet; liquid and solid precipitation in automatic measurement
  - 9 = Misidentification; missing value or precipitation form not determinable with automatic measurement
  - -999 = Faulty value

- `sun`: Hourly sunshine duration in minutes, Faulty value = -999 (Hourly sum)

- `visibility`: Visibility in 

Notiz: precip type und precip indicator wurden mit dem median auf den tag angepasstmeters, Faulty values = -999
V.Sichtweite: Sichtwe eprüft, gepflegt, nicht korrigiert;

In [4]:
df_weather = pd.read_csv(r"C:\Users\aisti\OneDrive\Dokumente\Uni\Bachelorarbeit\Daten\2012-04-25_2024-01-25_KA_weather_daily.csv")

In [5]:
df_weather.columns

Index(['date', 'temperature', 'humidity', 'windspeed', 'wind_direction',
       'visibility', 'precipitation', 'sun', 'windspeed_max', 'precip_indic',
       'precip_type'],
      dtype='object')

In [6]:
df_weather.dtypes

date               object
temperature       float64
humidity          float64
windspeed         float64
wind_direction    float64
visibility        float64
precipitation     float64
sun                 int64
windspeed_max     float64
precip_indic      float64
precip_type       float64
dtype: object

In [7]:
df_weather.describe()

Unnamed: 0,temperature,humidity,windspeed,wind_direction,visibility,precipitation,sun,windspeed_max,precip_indic,precip_type
count,4291.0,4291.0,4291.0,4291.0,4291.0,4291.0,4291.0,4291.0,4288.0,4106.0
mean,12.538591,73.092589,3.475722,176.76878,34281.616807,1.539361,309.323701,9.544232,0.191814,1.204822
std,7.636231,14.120612,1.81669,69.225224,16381.273131,3.911284,280.267835,3.956637,0.384103,2.355714
min,-7.827778,31.277778,0.6,16.666667,207.777778,0.0,0.0,2.0,0.0,0.0
25%,6.427778,62.611111,2.144444,129.722222,21317.777778,0.0,33.0,7.0,0.0,0.0
50%,12.45,74.666667,3.122222,206.666667,35432.222222,0.0,246.0,9.0,0.0,0.0
75%,18.705556,84.111111,4.355556,228.888889,46983.611111,1.0,548.0,11.7,0.0,0.0
max,31.794444,100.0,12.233333,340.0,71937.777778,58.3,909.0,31.0,1.0,8.0


In [8]:
df_weather.isna().sum()

date                0
temperature         0
humidity            0
windspeed           0
wind_direction      0
visibility          0
precipitation       0
sun                 0
windspeed_max       0
precip_indic        3
precip_type       185
dtype: int64

In [9]:
# Generate a complete range of dates spanning the desired period
complete_range = pd.date_range(start=df_weather['date'].min(), end=df_weather['date'].max(), freq='D')

# Find the missing dates by comparing the complete range with the actual dates in the DataFrame
missing_dates = complete_range[~complete_range.isin(df_weather['date'])]

# Print the missing dates
print("Missing dates:")
print(missing_dates)

Missing dates:
DatetimeIndex(['2018-04-15', '2018-04-16'], dtype='datetime64[ns]', freq='D')


  missing_dates = complete_range[~complete_range.isin(df_weather['date'])]
