# Data Exploration

In this notebook describe your data exploration steps.

## Install dependencies

In [19]:
%pip install pandas
%pip install plotly

4811.19s - pydevd: Sending message related to process being replaced timed-out after 5 seconds


Note: you may need to restart the kernel to use updated packages.


4818.04s - pydevd: Sending message related to process being replaced timed-out after 5 seconds


Note: you may need to restart the kernel to use updated packages.


## Load data

In [20]:
# import the modules
import pandas as pd 
import sqlite3

# connect to the database
con = sqlite3.connect("../data/data.sqlite")
 
# The following manipulations are done in SQL
# All columns from the table pedestrians are beeing selected and some renamed, but from the column 'time of measurement' only the first 10 characters are selected
# Additionally a column with row numbers is added to the table

# As the temperature and rain data are stored in two different tables, the data is combined in two subqueries
# The average of the temperature and rain data is calculated and a column with row numbers is added to the tables

# The three subqueries are joined together and the columns 'time', 'weekday', 'pedestrians', 'rain' and 'temperature' are selected

df = pd.read_sql_query('''

WITH PedestrianData AS (
    SELECT
        SUBSTR(p.[time of measurement], 1, 10) AS time,
        p.weekday AS weekday,
        p.[pedestrians count] AS pedestrians,
        ROW_NUMBER() OVER (ORDER BY p.[time of measurement]) AS row_num
    FROM pedestrians p
),
RainData AS (
    SELECT
        (r1.[Niederschlag (6 bis 6 UTC)] + r2.[Niederschlag (6 bis 6 UTC)]) / 2 AS rain,
        ROW_NUMBER() OVER (ORDER BY r1.[Niederschlag (6 bis 6 UTC)]) AS row_num
    FROM rainmoe r1, rainnue r2
),
TemperatureData AS (
    SELECT
        (t1.[Mittelwert] + t2.[Mittelwert]) / 2 AS temperature,
        ROW_NUMBER() OVER (ORDER BY t1.[Mittelwert]) AS row_num
    FROM tempmoe t1, tempnue t2
)

SELECT
    pd.time,
    pd.weekday,
    pd.pedestrians,
    rd.rain,
    td.temperature
FROM
    PedestrianData pd
JOIN
    RainData rd ON pd.row_num = rd.row_num
JOIN
    TemperatureData td ON pd.row_num = td.row_num;

''', con)

### Look at the first rows

In [21]:
df.head(20)

Unnamed: 0,time,weekday,pedestrians,rain,temperature
0,2024-01-01,Monday,9432,1.45,0.3
1,2024-01-02,Tuesday,8959,7.2,0.9
2,2024-01-03,Wednesday,10900,5.8,1.9
3,2024-01-04,Thursday,13322,0.45,0.9
4,2024-01-05,Friday,16804,0.0,-0.2
5,2024-01-06,Saturday,8917,0.0,-1.05
6,2024-01-07,Sunday,6496,0.0,-2.45
7,2024-01-08,Monday,11646,0.0,-4.5
8,2024-01-09,Tuesday,12452,0.0,-4.4
9,2024-01-10,Wednesday,11693,0.0,-4.3


### Data exploration
Print some basic information about the data. Your data exploration would continue here.

In [22]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 134 entries, 0 to 133
Data columns (total 5 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   time         134 non-null    object 
 1   weekday      134 non-null    object 
 2   pedestrians  134 non-null    int64  
 3   rain         134 non-null    float64
 4   temperature  134 non-null    float64
dtypes: float64(2), int64(1), object(2)
memory usage: 5.4+ KB
