# Exploring Our Dataset

This notebook provides a detailed exploration of our dataset.

## 1. Install dependencies

In [1]:
%pip install pandas
%pip install SQLAlchemy
%pip install numpy

Note: you may need to restart the kernel to use updated packages.
Note: you may need to restart the kernel to use updated packages.
Note: you may need to restart the kernel to use updated packages.


## 2. Load data from datasets

In [8]:
import pandas as pd

df1 = pd.read_sql_table('dataset1', 'sqlite:////Users/miro/PycharmProjects/made-template/data/data/dataset1.sqlite')


### 3. Dataset1: Bikers in Fahrrad-Dauerzählstellen München

This table represents data with columns: Date, Direction 1, Direction 2, and gesamt.
Each row corresponds to a specific date, and the values indicate the corresponding values for Direction 1, Direction 2, and the gesamt.

In [9]:
df1.head(8)

Unnamed: 0,datum,richtung_1,richtung_2,gesamt
0,2011-01-01,115.0,91.0,206.0
1,2011-01-02,136.0,109.0,245.0
2,2011-01-03,52.0,29.0,81.0
3,2011-01-04,23.0,14.0,37.0
4,2011-01-05,33.0,18.0,51.0
5,2011-01-06,57.0,25.0,82.0
6,2011-01-07,135.0,126.0,261.0
7,2011-01-08,232.0,238.0,470.0


## 4. Data exploration
This notebook provides a detailed exploration of our dataset, unraveling insights and patterns through analysis and visualization.

### 4.1. Dataset 1: Fahrrad-Dauerzählstellen München in München

Showing information about the indices and there datatypes

In [4]:
df1.columns

Index(['datum', 'richtung_1', 'richtung_2', 'gesamt'], dtype='object')

In [19]:
df1.rename(columns = {'datum':'Date'},inplace =True)
df1.rename(columns = {'richtung_1':'Direction 1'},inplace=True)
df1.rename(columns = {'richtung_2':'Direction 2'},inplace=True)
df1.rename(columns = {'gesamt':'Total'},inplace=True)

In [20]:
df1

Unnamed: 0,Date,Direction 1,Direction 2,Total
0,2011-01-01,115.0,91.0,206.0
1,2011-01-02,136.0,109.0,245.0
2,2011-01-03,52.0,29.0,81.0
3,2011-01-04,23.0,14.0,37.0
4,2011-01-05,33.0,18.0,51.0
...,...,...,...,...
360,2011-12-27,1581.0,1287.0,2868.0
361,2011-12-28,1761.0,1364.0,3125.0
362,2011-12-29,1442.0,1170.0,2612.0
363,2011-12-30,1138.0,915.0,2053.0


These operations will help us analyze the data and understand its distribution, central tendency, variability, and relationships between columns.

In [13]:
df1.describe()

Unnamed: 0,datum,richtung_1,richtung_2,gesamt
count,365,365.0,365.0,365.0
mean,2011-07-02 00:00:00,3096.652055,2457.484932,5554.136986
min,2011-01-01 00:00:00,23.0,14.0,37.0
25%,2011-04-02 00:00:00,1308.0,856.0,2161.0
50%,2011-07-02 00:00:00,2622.0,1798.0,4402.0
75%,2011-10-01 00:00:00,4389.0,3665.0,7936.0
max,2011-12-31 00:00:00,10380.0,8950.0,19330.0
std,,2238.026162,1996.83974,4224.029892


Show the total amount of bikers for the whole year

In [16]:
df1['gesamt'].sum()

2027260.0

Correlation matrix between columns

In [21]:
correlation_matrix = df1.corr()
print(correlation_matrix)

                 Date  Direction 1  Direction 2     Total
Date         1.000000     0.512314     0.551268  0.532043
Direction 1  0.512314     1.000000     0.989745  0.997717
Direction 2  0.551268     0.989745     1.000000  0.997132
Total        0.532043     0.997717     0.997132  1.000000


Showing the amount of bikers for each month

In [None]:
# Group by month and sum the 'Total'
dg = df1.groupby(pd.Grouper(key='Date', freq='M')).sum()

# Find the month with the highest total bikers
max_month = dg['Total'].idxmax()

print(f"The month with the highest total bikers is: {max_month.strftime('%B')}")


##### 4.1 correlation between the total number of bikers and the difference between bikers inward and outward

The correlation coefficient of 0.62 indicates a moderate positive correlation between these variables.

Here's a breakdown of the correlation coefficient values:

+------------------------+------------------------------------+
| Correlation Coefficient | Interpretation                     |
+------------------------+------------------------------------+
| 1.0                    | Perfect positive correlation        |
| 0.8 - 0.99             | Very strong positive correlation    |
| 0.6 - 0.79             | Strong positive correlation         |
| 0.4 - 0.59             | Moderate positive correlation       |
| 0.2 - 0.39             | Weak positive correlation           |
| 0.0 - 0.19             | No correlation                      |
| -0.2 - -0.39           | Weak negative correlation           |
| -0.4 - -0.59           | Moderate negative correlation       |
| -0.6 - -0.79           | Strong negative correlation         |
| -0.8 - -0.99           | Very strong negative correlation    |
| -1.0                   | Perfect negative correlation        |
+------------------------+------------------------------------+

In my case, a correlation coefficient of 0.62 suggests a moderate positive correlation between the 'Total' column and the difference between 'Direction 1' and 'Direction 2' columns. This means that as one variable increases, the other tends to increase as well, but the relationship is not perfect.

In [22]:
correlation = df1['Total'].corr(df1['Direction 1'] - df1['Direction 2'])
print(f"The correlation coefficient is {correlation:.2f}.")

The correlation coefficient is 0.62.


In [25]:
# convert the date column into a datetime object
df1['Date'] = pd.to_datetime(df1['Date'])

# extract the day, month, and year components
df1['day'] = df1['Date'].dt.day
df1['month'] = df1['Date'].dt.month
df1['year'] = df1['Date'].dt.year

# show the modified data frame
df1.head(5)

Unnamed: 0,Date,Direction 1,Direction 2,Total,day,month,year
0,2011-01-01,115.0,91.0,206.0,1,1,2011
1,2011-01-02,136.0,109.0,245.0,2,1,2011
2,2011-01-03,52.0,29.0,81.0,3,1,2011
3,2011-01-04,23.0,14.0,37.0,4,1,2011
4,2011-01-05,33.0,18.0,51.0,5,1,2011


 find the day with the highest total bikers for each month

In [28]:
# Group by month and find the day with the highest total bikers for each month
highest_bikers_per_month = df1.groupby(df1['Date'].dt.to_period("M"))['Total'].idxmax()

# Extract the corresponding rows
result_df = df1.loc[highest_bikers_per_month]

# Display the result
print(result_df[['month', 'day', 'Total']])

     month  day    Total
17       1   18   1913.0
37       2    7   2036.0
87       3   29   3650.0
99       4   10   4866.0
149      5   30   6624.0
178      6   28   7936.0
206      7   26  17526.0
213      8    2  19330.0
255      9   13  15597.0
276     10    4  13307.0
311     11    8  10026.0
335     12    2   7590.0


#### 4.2 find the day with the lowest total bikers for each month

This code creates 'month' and 'day' columns from the 'Date' column, and then extracts the specified columns
from the resulting DataFrame.

In [30]:
# Assuming 'Date' is in datetime format
df1['month'] = df1['Date'].dt.month
df1['day'] = df1['Date'].dt.day

# Group by month and find the day with the lowest total bikers for each month
lowest_bikers_per_month = df1.groupby(df1['Date'].dt.to_period("M"))['Total'].idxmin()

# Extract the corresponding rows
result_df = df1.loc[lowest_bikers_per_month]

# Display the result
print(result_df[['month', 'day', 'Total']])

     month  day   Total
3        1    4    37.0
31       2    1   558.0
85       3   27   512.0
93       4    4  1685.0
146      5   27  1516.0
168      6   18   733.0
203      7   23  2256.0
218      8    7  3221.0
260      9   18  1711.0
280     10    8  2460.0
329     11   26  3322.0
358     12   25   830.0


#### 4.3 calculate the sum Total bikers of the first week days of each month

In [31]:
# Filter the DataFrame to include only the first seven days of each month
first_seven_days = df1[df1['day'].between(1, 7)]

# Calculate the sum of 'Total bikers' for each month
monthly_sum = first_seven_days.groupby('month')['Total'].sum()

# Display the result
print(monthly_sum)


month
1       963.0
2      8230.0
3     10097.0
4     25720.0
5     28146.0
6     26689.0
7     37231.0
8     88761.0
9     67593.0
10    74276.0
11    53056.0
12    35828.0
Name: Total, dtype: float64
