# 1. Introduction (First page)

## 1.1 Context: Noise & Leuven 

context noise & Leuven (newspaper articles): 

- see newspaper articles for the relevance of noise topics and problems in Leuven (focus on night)
    - maybe add a few screenshots into the dashboard 

add a little text scetching the problem: 
- noise is a sensitive topic in many places, especially in cities!
- governments often have specific regulations to keep noise levels within certain limits
- The World Health Organization (WHO) recommends outdoor noise levels not exceeding 55 dB(A) during     daytime and 40 dB(A) during nighttime for residential areas.
- Leuven is giving special attention to noise problems (mostly related to student life) 
- They (City & KU Leuven) gathered noise data (for noise in The Naamsestraat in 2022), for the following locations: (see map)

In [72]:
import plotly.express as px
import pandas as pd

location_data = {
    'Naamsestraat 35': (50.877110, 4.700840),
    'Naamsestraat 57': (50.876490, 4.700700),
    'Naamsestraat 62': (50.875809, 4.700110),
    'Calvariekapel KU Leuven': (50.8745267, 4.6999168),
    'Parkstraat 2': (50.8741177, 4.7000138),
    'Kiosk Stadspark': (50.8752756, 4.7015081),
    'Naamsestraat 81': (50.8738250, 4.7001178),
    'Vrijthof': (50.8790375, 4.7011731),
    'His & Hears': (50.8752579, 4.7001115)
}

locations = pd.DataFrame(location_data.items(), columns=['Location', 'Coordinates'])
locations[['Latitude', 'Longitude']] = pd.DataFrame(locations['Coordinates'].tolist())

# Create a map + map layout 
fig = px.scatter_mapbox(locations, lat='Latitude', lon='Longitude',
                        size_max=40, color='Location',labels={'Location': 'Location'})

fig.update_layout(
    mapbox=dict(
        center=dict(lat=50.876490, lon=4.700700),
        zoom=14.7,
        style='carto-positron'
    ),
    showlegend=True
)

fig.show()



## 1.2 Research Objective

We could maybe add the two research objectives (exploring) and (predicting) on the first page into a text block as well. 

Research objectives: 
1. Exploring the data through data visualizations 
    - How is the noise distributed over time and location?
    - Do we find any patterns in and associations with noise events and facebook events? 
    -...
2. Prediction of median noise levels, specifically noise at night, using machine learning techniques 

## 1.3 Datafiles we work with 

Export 40: 
- contains percentiled noise data per hour during the months of february untill december 
- for X amount of locations in The Naamsestraat (see map)
- Noise values in this dataset are dB(A) measures, which reflects the impact of noise on human  perception and potential harm to human health. It takes into account the varying sensitivity of the human ear to different frequencies. The A-weighted decibel scale provides a more accurate representation of the perceived loudness of noise for human listeners.
- dB(A) is a logarithmic scale, where each increase of 10 dB represents a tenfold increase in sound intensity.
- some empty datasets 

Export 41:
- contains noise events registered during same time period for these locations with specific certainty level 
- some empty datasets 

Facebook events data:
- data gathered for the year 2022 on events in Leuven 
-...

Meteo data: 
- meteo data for 2022
-...

# 2. Noise Data (Second page)

### comments: 

- Visuals are either performed on 1 specific location 'Naamsestraat 35' or for all locations together (if main interest). it would be nice if we can have  a drop down menu for different locations + for some visuals a global image over all the locations might be more interesting 

## Preprocess Exp40

In [73]:
import pandas as pd
import plotly.graph_objects as go
from sklearn.preprocessing import LabelEncoder
import matplotlib.pyplot as plt
import matplotlib.animation as animation

def gather_exp40(folder_path):
    exp40_data_path = [folder_path + '/csv_results_40_255439_mp-01-naamsestraat-35-maxim.csv',
                       folder_path + '/csv_results_40_255440_mp-02-naamsestraat-57-xior.csv',
                       folder_path + '/csv_results_40_255441_mp-03-naamsestraat-62-taste.csv',
                       folder_path + '/csv_results_40_255442_mp-05-calvariekapel-ku-leuven.csv',
                       folder_path + '/csv_results_40_255443_mp-06-parkstraat-2-la-filosovia.csv',
                       folder_path + '/csv_results_40_255444_mp-07-naamsestraat-81.csv',
                       folder_path + '/csv_results_40_255445_mp-08-kiosk-stadspark.csv',
                       folder_path + '/csv_results_40_280324_mp08bis---vrijthof.csv',
                       folder_path + '/csv_results_40_303910_mp-04-his-hears.csv']
    exp40_data = []
    
    for i in exp40_data_path:
        exp40_data.append(pd.read_csv(i, sep = ';'))
    return exp40_data

def divide_timestamp(df):
    df_final = df.copy()
    df_final['result_timestamp'] = df_final['result_timestamp'].str[:19]
    df_final['month'] = df_final['result_timestamp'].str[3:5].astype('int32')
    df_final['day'] = df_final['result_timestamp'].str[0:2].astype('int32')
    df_final['hour'] = df_final['result_timestamp'].str[11:13].astype('int32')
    df_final['day_month'] = df_final['day'].astype(str) + '/' + df_final['month'].astype(str)
    df_final['weekday'] = pd.to_datetime(df_final['result_timestamp']).dt.strftime("%A")
    return df_final

def drop_modify_exp40(df, first=True):
    final = []
    description_mapping = {
        'MP 01: Naamsestraat 35  Maxim': 'Naamsestraat 35',
        'MP 02: Naamsestraat 57 Xior': 'Naamsestraat 57',
        'MP 03: Naamsestraat 62 Taste': 'Naamsestraat 62',
        'MP 04: His & Hears': 'Naamsestraat 76',
        'MP 05: Calvariekapel KU Leuven': 'Calvariekapel KU Leuven',
        'MP 06: Parkstraat 2 La Filosovia': 'Parkstraat 2',
        'MP 07: Naamsestraat 81': 'Naamsestraat 81',
        'MP08bis - Vrijthof': 'Vrijthof'
       
    }
    
    for data in df:
        datadrop = data.drop(["laf005_per_hour_unit", "laf01_per_hour_unit", "laf05_per_hour_unit", "laf10_per_hour_unit",
                   "laf25_per_hour_unit", "laf50_per_hour_unit", "laf75_per_hour_unit", "laf90_per_hour_unit",
                   "laf95_per_hour_unit", "laf98_per_hour_unit", "laf99_per_hour_unit", "laf995_per_hour_unit"],
                  axis=1).copy()
        data_final = divide_timestamp(datadrop)
        data_final['description'] = data_final['description'].replace(description_mapping)
        final.append(data_final)
    return final



def initial_preprocessing_exp40(folder_path, first = True):
    exp40_data = gather_exp40(folder_path)
    exp40_final = drop_modify_exp40(exp40_data)
    return exp40_final

In [74]:
[df1_N, df2_N, df3_N, df4_N, df5_N, df6_N, df7_N, df8_N, df9_N] = initial_preprocessing_exp40("C:/Users/fieuw/Desktop/export_40", first=True)
df_N = pd.concat([df1_N, df2_N, df3_N, df4_N, df5_N, df6_N, df7_N, df8_N, df9_N], ignore_index=True)

## 2.1 Time series plots for percentiles 25, 50, & 75 

In [75]:
import pandas as pd
import plotly.graph_objs as go

# Filter the data for 'Naamsestraat 35' and create a new DataFrame
filtered_df = df_N[df_N['description'] == 'Naamsestraat 35'].copy()

# Convert 'result_timestamp' column to datetime data type
filtered_df['result_timestamp'] = pd.to_datetime(filtered_df['result_timestamp'])

# Set 'result_timestamp' as the index
filtered_df.set_index('result_timestamp', inplace=True)

# Sort the DataFrame by 'result_timestamp'
filtered_df.sort_index(inplace=True)

# Calculate hourly and daily averages for laf50, laf25, and laf75
hourly_laf50 = filtered_df['laf50_per_hour']
hourly_laf25 = filtered_df['laf25_per_hour']
hourly_laf75 = filtered_df['laf75_per_hour']
daily_avg = filtered_df.resample('D').mean()['laf50_per_hour']
monthly_avg = filtered_df.resample('M').mean()['laf50_per_hour']  # Calculate monthly average

# Identify the loudest and least loud months
loudest_month = monthly_avg.idxmax().strftime('%B')  # Get the month with the highest average
least_loud_month = monthly_avg.idxmin().strftime('%B')  # Get the month with the lowest average

# Create a trace for hourly laf50 values
trace_hourly_laf50 = go.Scatter(
    x=hourly_laf50.index,
    y=hourly_laf50,
    mode='lines',
    name='Hourly LAF50',
    line=dict(color='rgba(135, 206, 250, 1)')
)

# Create a trace for hourly average laf25 values
trace_hourly_laf25 = go.Scatter(
    x=hourly_laf25.index,
    y=hourly_laf25,
    mode='lines',
    name='Hourly LAF25',
    line=dict(color='rgba(0, 0, 255, 0.3)', width=1.5)
)

# Create a trace for hourly average laf75 values
trace_hourly_laf75 = go.Scatter(
    x=hourly_laf75.index,
    y=hourly_laf75,
    mode='lines',
    name='Hourly LAF75',
    line=dict(color='rgba(255, 165, 0, 0.3)', width=1.5)
)

# Create a trace for daily average values
trace_daily = go.Scatter(
    x=daily_avg.index,
    y=daily_avg,
    mode='markers',
    name='Daily Average',
    marker=dict(symbol='circle', size=4, color='blue')
)

# Create a trace for monthly average values
trace_monthly_avg = go.Scatter(
    x=monthly_avg.index,
    y=monthly_avg,
    mode='lines',
    name='Monthly Average',
    line=dict(color='rgba(0, 0, 255, 0.8)', width=2)  # Change the color and line width
)

# Add annotations for loudest and least loud months
annotations = [
    dict(
        x=monthly_avg.idxmax(),
        y=monthly_avg.max(),
        xref='x',
        yref='y',
        text=f"Loudest: {loudest_month}",
        showarrow=True,
        arrowhead=7,
        ax=0,
        ay=-40
    ),
    dict(
        x=monthly_avg.idxmin(),
        y=monthly_avg.min(),
        xref='x',
        yref='y',
        text=f"Least Loud: {least_loud_month}",
        showarrow=True,
        arrowhead=7,
        ax=0,
        ay=40
    )
]

# Create the plot layout with white background and annotations
layout = go.Layout(
    title='Time Series of Sound Levels - Naamsestraat 35',
    xaxis=dict(title='Time'),
    yaxis=dict(title='dB(A)'),
    showlegend=True,
    plot_bgcolor='white',
    annotations=annotations
)

# Create the figure and add the traces
fig = go.Figure(data=[trace_hourly_laf25, trace_hourly_laf75, trace_hourly_laf50, trace_daily, trace_monthly_avg], layout=layout)

# Display the plot
fig.show()


### findings 2.1

This plot shows us the hourly evolution of median noise values over time, as well as the 25th and 75th percentile for 1 location! (still would want to have the hour in the thumbnail). We can observe the general trend over time. There is some variation in daily noise. We see that based on the median noise values, March was the loudest month. 
However, we need to have  more detailed information, like which days of the week are the noisiest? which hours exactly? ? 

## 2.2 Heatmaps with weekdays and hours 

In [76]:
# Define the desired order of weekdays
weekday_order = ['Monday', 'Tuesday', 'Wednesday', 'Thursday', 'Friday', 'Saturday', 'Sunday']

# Calculate the average laf50_per_hour and laf01_per_hour values for each weekday and hour combination
heatmap_data_laf50 = df_N.groupby(['weekday', 'hour'])['laf50_per_hour'].mean().unstack()
heatmap_data_laf01 = df_N.groupby(['weekday', 'hour'])['laf01_per_hour'].mean().unstack()

# Reorder the rows based on the desired order of weekdays
heatmap_data_laf50 = heatmap_data_laf50.reindex(weekday_order)
heatmap_data_laf01 = heatmap_data_laf01.reindex(weekday_order)

# Create the heatmap figure for laf50_per_hour
fig_laf50 = go.Figure(data=go.Heatmap(
    z=heatmap_data_laf50.values,
    x=heatmap_data_laf50.columns,
    y=heatmap_data_laf50.index,
    colorscale='YlGnBu',
    colorbar=dict(title='Average LAF50')
))
fig_laf50.update_layout(
    title='Average LAF50 per Hour and Weekday over Locations',
    xaxis=dict(title='Hour'),
    yaxis=dict(title='Weekday')
)

### Findings 2.2

This plot plots the median noise values for different weekdays and hours (over all locations) in a heatmap. We observe for example that weekends on average are less noisy in the morning and that friday nights are louder than other nights. We will need some summary statistics to get a better view on which weekday is the noisiest. We might also be interested to take a closer look into daytime-nighttime distinction. We will define nighttime (22-6 (6 excluded)) and daytime (6-22 (22 excluded))

## 2.3 Summary & Spread Statistics 

In [77]:
import pandas as pd
from tabulate import tabulate

# Define the order of weekdays
weekday_order = ['Monday', 'Tuesday', 'Wednesday', 'Thursday', 'Friday', 'Saturday', 'Sunday']


# Calculate the average noise values by weekday
average_noise = df_N.groupby('weekday')['laf50_per_hour'].mean().reset_index()

# Filter the dataset for the specific description and night hours
filtered_data_night = df_N[(df_N['description'] == 'Naamsestraat 35') & ((df_N['hour'] >= 22) | (df_N['hour'] < 6))]

# Calculate the average noise values by weekday
average_noise_night = filtered_data_night.groupby('weekday')['laf50_per_hour'].mean().reset_index()

# Filter the dataset for the specific description and day hours
filtered_data_day = df_N[(df_N['description'] == 'Naamsestraat 35') & ((df_N['hour'] >= 6) & (df_N['hour'] < 22))]

# Calculate the average noise values by weekday
average_noise_day = filtered_data_day.groupby('weekday')['laf50_per_hour'].mean().reset_index()

# Merge the average noise dataframes
merged_data = average_noise.merge(average_noise_night, on='weekday', suffixes=('_day', '_night'))
merged_data = merged_data.merge(average_noise_day, on='weekday')

# Rename the columns
merged_data.columns = ['Weekday', 'Average Noise (Day)', 'Average Noise (Night)', 'Average Noise (Daytime)']

# Set the Weekday column as a categorical variable with the specified order
merged_data['Weekday'] = pd.Categorical(merged_data['Weekday'], categories=weekday_order, ordered=True)

# Sort the dataframe by the weekday order
merged_data.sort_values('Weekday', inplace=True)

# Set the Weekday column as the index
merged_data.set_index('Weekday', inplace=True)

# Display the table
print(tabulate(merged_data, headers='keys', tablefmt='psql'))


+-----------+-----------------------+-------------------------+---------------------------+
| Weekday   |   Average Noise (Day) |   Average Noise (Night) |   Average Noise (Daytime) |
|-----------+-----------------------+-------------------------+---------------------------|
| Monday    |               49.1331 |                 45.5429 |                   52.2211 |
| Tuesday   |               49.5471 |                 47.2007 |                   52.5698 |
| Wednesday |               49.7446 |                 48.9245 |                   52.5098 |
| Thursday  |               50.6822 |                 50.6087 |                   53.3299 |
| Friday    |               50.4771 |                 51.0369 |                   53.009  |
| Saturday  |               48.7356 |                 48.975  |                   50.8258 |
| Sunday    |               47.9701 |                 47.9291 |                   50.153  |
+-----------+-----------------------+-------------------------+-----------------

This table can provide an overview of the noisiest day / night and daytime. Again it is based on the laf50_per_hour values, which are the median hourly noise values in db(A). It would be nice if we could highlight the highest noise value in each column, that's a quick way to visualize for each description, which weekday/night is the noisiest. 

findings: 
Though night has lower median noise values, they are still relatively high, especially for thurday and friday in this case. As we have seen in the beginning, the WHO's suggestion are < 40 db during night and < 55 during the day. It's especially the night that seems to be problematic if we evaluate this rule. Still remember, we look at averaged median noise values. Meaning on average in an hour during a friday night, only half of the time are db detected that are lower than or equal to 50! 

Now, it could be interesting to look into specific detected noise events that might be related to this problematic night noise. Let's look at that in next section 

# 3 Noise Events Data (third page)

## 3. 1 Preprocess Exp41 - Serkan's code with little adaptation 

In [78]:
def gather_exp41(folder_path):
    exp41_data_path = [folder_path + '/csv_results_41_255439_mp-01-naamsestraat-35-maxim.csv',
                       folder_path + '/csv_results_41_255440_mp-02-naamsestraat-57-xior.csv',
                       folder_path + '/csv_results_41_255441_mp-03-naamsestraat-62-taste.csv',
                       folder_path + '/csv_results_41_255442_mp-05-calvariekapel-ku-leuven.csv',
                       folder_path + '/csv_results_41_255443_mp-06-parkstraat-2-la-filosovia.csv',
                       folder_path + '/csv_results_41_255444_mp-07-naamsestraat-81.csv',
                       folder_path + '/csv_results_41_255445_mp-08-kiosk-stadspark.csv',
                       folder_path + '/csv_results_41_280324_mp08bis---vrijthof.csv',
                       folder_path + '/csv_results_41_303910_mp-04-his-hears.csv']
    exp41_data = []
    
    for i in exp41_data_path:
        exp41_data.append(pd.read_csv(i, sep = ';'))
    return exp41_data


def divide_timestamp(df):
    df_final = df.copy()
    df_final['result_timestamp'] = df.result_timestamp.str[:19]
    df_final['year'] = df.result_timestamp.str[6:10].astype('int32')
    df_final['month'] = df.result_timestamp.str[3:5].astype('int32')
    df_final['day'] = df.result_timestamp.str[0:2].astype('int32')
    df_final['hour'] = df.result_timestamp.str[11:13].astype('int32')
    df_final['minute'] = df.result_timestamp.str[14:16].astype('int32')
    df_final['second'] = df.result_timestamp.str[17:19].astype('int32')
    return df_final


def drop_modify_exp41(df, first=True):
    final = []
    description_mapping = {
        'MP 01: Naamsestraat 35  Maxim': 'Naamsestraat 35',
        'MP 02: Naamsestraat 57 Xior': 'Naamsestraat 57',
        'MP 03: Naamsestraat 62 Taste': 'Naamsestraat 62',
        'MP 05: Calvariekapel KU Leuven': 'Calvariekapel KU Leuven',
        'MP 06: Parkstraat 2 La Filosovia': 'Parkstraat 2',
        'MP 07: Naamsestraat 81': 'Naamsestraat 81',
        'MP08bis - Vrijthof': 'Vrijthof'
    }
    
    for data in df:
        data_nan = data.dropna(subset=['noise_event_laeq_primary_detected_certainty'])
        data_nan_drop = data_nan.drop(['noise_event_laeq_model_id_unit', 'noise_event_laeq_primary_detected_certainty_unit', 'noise_event_laeq_primary_detected_class_unit'], axis=1)
        data_nan_drop_uncertain75 = data_nan_drop[data_nan_drop['noise_event_laeq_primary_detected_certainty'] > 75]
        data_final = divide_timestamp(data_nan_drop_uncertain75)
        data_final['description'] = data_final['description'].replace(description_mapping)
        
        if first:
            le = LabelEncoder()
            data_final['noise_event_class'] = le.fit_transform(data_final['noise_event_laeq_primary_detected_class'])
            first = False
        else:
            data_final['noise_event_class'] = le.transform(data_final['noise_event_laeq_primary_detected_class'])
        
        final.append(data_final)
    
    return final

def initial_preprocessing_exp41(folder_path, first = True):
    exp41_data = gather_exp41(folder_path)
    exp41_final = drop_modify_exp41(exp41_data)
    return exp41_final

In [79]:
[df1_E, df2_E, df3_E, df4_E, df5_E, df6_E, df7_E, df8_E, df9_E] = initial_preprocessing_exp41("C:/Users/fieuw/Desktop/export_41", first=True)
df_E = pd.concat([df1_E,df2_E,df3_E,df4_E,df5_E,df6_E,df7_E,df8_E,df9_E], ignore_index=True)

## 3.2 stacked bar chart with count distribution 

In [80]:
import plotly.graph_objects as go

# Group the data by hour and noise_event_laeq_primary_detected_class
grouped_df = df_E.groupby(['hour', 'noise_event_laeq_primary_detected_class']).size().unstack()

# Create a list of colors for the different classes
colors = ['rgb(31, 119, 180)', 'rgb(255, 127, 14)', 'rgb(44, 160, 44)', 'rgb(214, 39, 40)', 'rgb(148, 103, 189)', 'rgb(140, 86, 75)']

# Create a list to store the traces for each class
traces = []

# Iterate over each class
for i, column in enumerate(grouped_df.columns):
    # Create a bar trace for the current class
    trace = go.Bar(
        x=grouped_df.index,
        y=grouped_df[column],
        name=column,
        marker=dict(color=colors[i])
    )
    traces.append(trace)

# Create the stacked bar chart layout
layout = go.Layout(
    title="Counts Distribution of Noise Events by Hour",
    xaxis=dict(title="Hour"),
    yaxis=dict(title="Count"),
    barmode="stack"
)

# Create the figure with the traces and layout
fig = go.Figure(data=traces, layout=layout)

# Show the stacked bar chart
fig.show()


This stacked bar chart shows us simply the counts of detected noise events during the day. In general we see that the most detected noise event is the car, followed by Shouting. We see that shouting occurs most often between 11p.m. and 4 a.m. with a record for 1 a.m. 

## 3.3 Pie chart for nighttime / daytime distribution

In [81]:
import pandas as pd
import plotly.graph_objects as go

# Categorize hours as daytime or nighttime
def categorize_time(hour):
    if hour >= 22 or hour < 6:
        return 'Nighttime'
    else:
        return 'Daytime'

# Apply the categorization function to create 'Time_of_day' column
df_E['Time_of_day'] = df_E['hour'].apply(categorize_time)

# Calculate the sums for nighttime and daytime
nighttime_sums = df_E[df_E['Time_of_day'] == 'Nighttime'].groupby('noise_event_laeq_primary_detected_class')['noise_event_laeq_primary_detected_class'].count()
daytime_sums = df_E[df_E['Time_of_day'] == 'Daytime'].groupby('noise_event_laeq_primary_detected_class')['noise_event_laeq_primary_detected_class'].count()

# Create the data for the pie charts
nighttime_labels = nighttime_sums.index.tolist()
nighttime_sizes = nighttime_sums.values.tolist()
daytime_labels = daytime_sums.index.tolist()
daytime_sizes = daytime_sums.values.tolist()

# Define color palette for both charts
color_palette = ['rgb(31, 119, 180)', 'rgb(255, 127, 14)', 'rgb(44, 160, 44)', 'rgb(214, 39, 40)', 'rgb(148, 103, 189)', 'rgb(140, 86, 75)']

# Create the first pie chart for nighttime
fig_nighttime = go.Figure(data=[go.Pie(
    labels=nighttime_labels,
    values=nighttime_sizes,
    marker=dict(colors=color_palette[:len(nighttime_labels)]),
    textinfo='label+percent',
    insidetextorientation='radial',
    textposition='outside'
)])

fig_nighttime.update_layout(
    title_text='Nighttime Distribution',
    showlegend=False
)

# Create the second pie chart for daytime
fig_daytime = go.Figure(data=[go.Pie(
    labels=daytime_labels,
    values=daytime_sizes,
    marker=dict(colors=color_palette[:len(daytime_labels)]),
    textinfo='label+percent',
    insidetextorientation='radial',
    textposition='outside'
)])

fig_daytime.update_layout(
    title_text='Daytime Distribution',
    showlegend=False
)

# Display the pie charts
fig_nighttime.show()
fig_daytime.show()
