# ANALYTICS

To create effective visualizations that clearly explain the data, we will employ various analytical techniques, including dimensionality reduction and data modeling, and experiment with different chart types to extract insights.

First, we need to make sure we have **accidents_preprocessed.csv** file inside **data** folder, the dataset we will work with, which contains information about motor crashes and colissions in New York city during summer months of 2018. This file is already preprocessed so not data issues should arise, other than aggregations for different data representations. For more information on how the dataset was preprocessed reach preprocessing.ipynb file.

The objective of this part is to understand key issues regarding traffic accidents, such as which boroughs suffer most accidents, proportion between injuries and fatalities, which weather conditions are related to a higher number of accidents, which vehicle type is more prone to suffer accidents, etc. As well as new insights discovered while analyzing data.

Fot this part it will be used Python 3.12, although other versions may work as well. As for the charts it will be used altair libraries, plus scikit-learn libraries for dimensionality reduction, so make sure it is installed on your system when executing this code.

*Be aware all plots' visualization in this file have been commented to make it less heavy, but can be easly be uncommented and executed for its visualization.

In [35]:
import pandas as pd
import numpy as np
import altair as alt

In [36]:
alt.data_transformers.disable_max_rows()

DataTransformerRegistry.enable('default')

In [37]:
df = pd.read_csv('data/accidents_preprocessed.csv')

Dropdown and interactive selections for visualizations

In [38]:
# Interactive dropdown options
options_month = ['June', 'July', 'August', 'September']
input_dropdown_month = alt.binding_select(
    options=[None] + options_month, labels=['All'] + options_month, name='Month: '
)
selection_month = alt.param(name='SelectMonth', value=None, bind=input_dropdown_month)

In [39]:
selection_vehicle = alt.selection_point(fields=['VEHICLE TYPE'], name="SelectVehicle", empty="all")
selection_borough = alt.selection_point(fields=['BOROUGH'], name="SelectBorough", empty="all")
selection_weather = alt.selection_point(fields=['WEATHER'], name="SelectWeather", empty="all")
selection_point = alt.selection_point(fields=['HOUR'], name='SelectPoint', empty='all')



## DIMENSIONALITY REDUCTION

First, relations among variables and observations will be analyzed with dimensionallity reduction techniques

In [40]:
from sklearn.cluster import KMeans
from sklearn.preprocessing import StandardScaler, OrdinalEncoder
from sklearn.decomposition import PCA

In [41]:
selected_columns = ['BOROUGH', 'ZIP CODE', 'LATITUDE', 'LONGITUDE', 
                    'CONTRIBUTING FACTOR', 'VEHICLE TYPE', 'MONTH', 
                    'HOUR', 'WEEK_DAY', 'DAY', 'WEATHER', 
                    'TOTAL_INJURIES', 'TOTAL_DEATHS']

categorical_cols = ['BOROUGH', 'CONTRIBUTING FACTOR', 'VEHICLE TYPE', 'MONTH', 'WEEK_DAY', 'WEATHER']

df_pca = df[selected_columns].dropna().copy()

df_pca['OriginalIndex'] = df_pca.index

for c in categorical_cols:
    df_pca[c] = df_pca[c].astype(str)

In [42]:
# Ordinal encoding of categorical variables
encoder = OrdinalEncoder()
df_pca[categorical_cols] = encoder.fit_transform(df_pca[categorical_cols])

# Scaling
scaler = StandardScaler()
X_scaled = scaler.fit_transform(df_pca.drop(columns=['OriginalIndex']))

# Fit PCA
pca = PCA()
X_pca = pca.fit_transform(X_scaled)

In [43]:
# Get explanations
explained_var = pca.explained_variance_ratio_

# Scree data
scree_data = pd.DataFrame({
    'Component': np.arange(1, len(explained_var)+1),
    'Explained_Variance': explained_var
})

scree_plot = alt.Chart(scree_data).mark_line(point=True).encode(
    x=alt.X('Component:O', title='Componente'),
    y=alt.Y('Explained_Variance:Q', title='Varianza Explicada'),
    tooltip=['Component', 'Explained_Variance']
).properties(
    title='Scree Plot',
    width=400,
    height=300
)

scree_plot

From the scree plot we can see principal components are not very informative, and that most important components are the first two

In [44]:
df_pca['PC1'] = X_pca[:, 0]
df_pca['PC2'] = X_pca[:, 1]

#Applying KMeans
kmeans = KMeans(n_clusters=4, random_state=42) # cluster size = 4 has been decided after thorough analysis taking into account the elbow method and relation to categories
df_pca['cluster'] = kmeans.fit_predict(X_pca[:, :2])

In [45]:
# merge with original data
df_pca_coords = df_pca[['OriginalIndex', 'PC1', 'PC2', 'cluster']]

df = df.merge(
    df_pca_coords,
    how='left',
    left_index=True,           
    right_on='OriginalIndex'
)

df.drop(columns=['OriginalIndex'], inplace=True)

In [46]:
loadings = pca.components_.T * np.sqrt(pca.explained_variance_)
feature_names = df_pca.drop(columns=['OriginalIndex','PC1','PC2','cluster']).columns
loading_df = pd.DataFrame(
    loadings,
    index=feature_names,
    columns=[f'PC{i}' for i in range(1, len(pca.components_)+1)]
)

loading_2d = loading_df[['PC1', 'PC2']].reset_index()
loading_2d.columns = ['Variable', 'PC1', 'PC2']
arrow_scale = 2.0
loading_2d['x_end'] = loading_2d['PC1'] * arrow_scale
loading_2d['y_end'] = loading_2d['PC2'] * arrow_scale

print("Loading factors:")
print(loading_df)

Loading factors:
                          PC1       PC2       PC3       PC4       PC5  \
BOROUGH             -0.648679 -0.377191  0.148980 -0.026039  0.141196   
ZIP CODE             0.136186 -0.277569  0.100594  0.223353 -0.174573   
LATITUDE             0.839607  0.002042  0.042861  0.028535  0.060385   
LONGITUDE            0.550999 -0.558048  0.194689 -0.001324  0.113764   
CONTRIBUTING FACTOR  0.240717  0.287894 -0.151088 -0.412035 -0.225774   
VEHICLE TYPE         0.045652  0.566240 -0.131415  0.392310  0.265113   
MONTH               -0.007865 -0.048875  0.264923 -0.309406  0.548209   
HOUR                -0.060824 -0.181642  0.042339  0.272427 -0.436564   
WEEK_DAY            -0.013085  0.164494  0.608979  0.114960 -0.136521   
DAY                 -0.011523  0.177839  0.152489  0.187277 -0.312603   
WEATHER              0.001402  0.284646  0.699720  0.005107  0.043518   
TOTAL_INJURIES       0.037795 -0.125840 -0.067181  0.634383  0.175201   
TOTAL_DEATHS         0.021741  0.0

We can see that principal component 1 is highly influenced by the borough, latitude, longitude. Second principal component is influenced by vehicle type, weather, contributing factor, borough abd zip code. Finally, principal component 3 does not rely that much on location but focuses more on weather and the week's day

In [47]:
# for a posterior plot of loading factors in PCA scatter plot
line_data = []
for i, row in loading_2d.iterrows():
    line_data.append({
        'Variable': row['Variable'], 'x': 0, 'y': 0
    })
    line_data.append({
        'Variable': row['Variable'], 'x': row['x_end'], 'y': row['y_end']
    })
line_df = pd.DataFrame(line_data)

In [48]:
points = alt.Chart(df).mark_circle(size=40).encode(
    x=alt.X('PC1:Q', title='PC1'),
    y=alt.Y('PC2:Q', title='PC2'),
    color=alt.condition(selection_vehicle & selection_borough & selection_weather & selection_point, 
                        'cluster:N', 
                        alt.value('lightgray')),
).add_params(
    selection_weather, selection_month, selection_point, selection_vehicle, selection_borough
).transform_filter(
    (alt.datum.MONTH == selection_month) | (selection_month == None)
    
).properties(
    width=600,
    height=400,
    title='Biplot PCA with Clusters'
)

# Loading factors
lines = alt.Chart(line_df).mark_line(color='black').encode(
    x='x:Q',
    y='y:Q',
    detail='Variable:N'
)

arrow_heads = alt.Chart(loading_2d).mark_point(color='black').encode(
    x='x_end:Q',
    y='y_end:Q',
    tooltip=['Variable']
)

text = alt.Chart(loading_2d).mark_text(
    align='left',
    dx=5,
    dy=-5,
    color='black'
).encode(
    x='x_end:Q',
    y='y_end:Q',
    text='Variable'
)

biplot = (points + lines + arrow_heads + text)
#biplot

In [49]:
# Visualize interaction of PCA scatterplot with different variables
def create_biplot(color_column, color_title):
    points = alt.Chart(df_pca).mark_circle(size=40).encode(
        x=alt.X('PC1:Q', title='PC1'),
        y=alt.Y('PC2:Q', title='PC2'),
        color=alt.Color(color_column, title=color_title)
    ).properties(
        width=400,
        height=300
    ).interactive()
    
    return points + lines + arrow_heads + text

biplot_borough = create_biplot('BOROUGH:N', 'Borough')
biplot_vehicle = create_biplot('VEHICLE TYPE:N', 'Vehicle Type')
biplot_weather = create_biplot('WEATHER:N', 'Weather')
biplot_factor  = create_biplot('CONTRIBUTING FACTOR:N', 'Contributing Factor')

final_chart = alt.vconcat(
    alt.hconcat(biplot_borough, biplot_vehicle),
    alt.hconcat(biplot_weather, biplot_factor)
).properties(title='Biplots PCA coloreados por distintas columnas')

#final_chart


We can see that it is the Borough variable the one to better cluster observations in 4 groups as, although there are 5 boroughs, borough 1 and 2 (orange and red) are mixed, with observations falling within the same group. Whereas for vehicle type, weather and contributing factor there is no clear differences among observations.

In [16]:
from sklearn.manifold import TSNE
from sklearn.preprocessing import OrdinalEncoder, StandardScaler

selected_columns = [
    'BOROUGH', 'ZIP CODE', 'LATITUDE', 'LONGITUDE', 
    'CONTRIBUTING FACTOR', 'VEHICLE TYPE', 'MONTH', 'HOUR', 
    'WEEK_DAY', 'DAY', 'WEATHER', 'TOTAL_INJURIES', 'TOTAL_DEATHS'
]

df_selected = df[selected_columns].dropna().copy()

categorical_cols = [
    'BOROUGH', 'CONTRIBUTING FACTOR', 'VEHICLE TYPE', 
    'MONTH', 'WEEK_DAY', 'WEATHER'
]

for c in categorical_cols:
    df_selected[c] = df_selected[c].astype(str)

encoder = OrdinalEncoder()
df_selected[categorical_cols] = encoder.fit_transform(df_selected[categorical_cols])


scaler = StandardScaler()
X_scaled = scaler.fit_transform(df_selected)

# Apply t-SNE
tsne = TSNE(n_components=2, perplexity=30, random_state=42)
X_tsne = tsne.fit_transform(X_scaled)

In [50]:
# Visualize t-sne results
df_vis = pd.DataFrame(X_tsne, columns=['TSNE1', 'TSNE2'])

df_vis = df_vis.join(df_selected.reset_index(drop=True))

df_vis['index'] = df_vis.index.astype(str)

tooltip_cols = ['index', 'BOROUGH', 'VEHICLE TYPE', 'WEATHER', 'TOTAL_INJURIES']

chart = alt.Chart(df_vis).mark_circle(size=60).encode(
    x=alt.X('TSNE1:Q', title='t-SNE Dim 1'),
    y=alt.Y('TSNE2:Q', title='t-SNE Dim 2'),
    color=alt.Color('BOROUGH:N', title='Borough'),
    tooltip=tooltip_cols
).properties(
    width=600,
    height=400,
    title='Visualización t-SNE'
).interactive()

#chart

T-sne results where not concluent, as no relation with original variables could be extracted. Hence, it is PCA that will be included in the final dashboard. (*The analog representation to the one done for PCA of colors linked to categories of different variables was conducted, yet it is not shown in this notebook as computations and results where quite heavy and little inside was extracted. However, we encourage the reader to do the analysis if interest)

## ACCIDENTS MAP

To effectively analyze the locations of accidents, we utilize a geospatial dot map that encodes the severity of each incident. Choropleth maps were excluded from our analysis due to their reliance on data aggregation, which can conceal critical information and reduce the granularity of the insights.

In [51]:
df['LATITUDE'] = pd.to_numeric(df['LATITUDE'], errors='coerce')
df['LONGITUDE'] = pd.to_numeric(df['LONGITUDE'], errors='coerce')

df = df.dropna(subset=['LATITUDE', 'LONGITUDE'])

In [52]:
# GeoJSON url downloaded from GitHub
raw_geojson_url = 'https://raw.githubusercontent.com/AlbaGrc/Traffic_Accidents_NYC/main/NYC_map.geojson'

ny_city_map = alt.Data(
    url=raw_geojson_url,
    format=alt.DataFormat(property='features')
)

In [53]:
# Crear el mapa base
nyc_base_map = alt.Chart(ny_city_map).mark_geoshape(
    fill='lightgray', stroke='white', strokeWidth=1.3, opacity=0.4
).encode(tooltip=alt.value(None))


# Crear la capa de puntos de accidentes
points = alt.Chart(df).mark_circle(size=30, color='red', opacity=0.6).encode(
    longitude='LONGITUDE:Q',
    latitude='LATITUDE:Q',
    tooltip=['DATETIME:N', 'BOROUGH:N', 'ZIP CODE:N', 'NUMBER OF PERSONS INJURED:Q'],
).add_params(
    selection_month
).transform_filter(
    (alt.datum.MONTH == selection_month) | (selection_month == None)
)

# Habilitar zoom y paneo
final_map = (nyc_base_map + points).properties(
    width=800,
    height=600
).interactive()

# Mostrar el mapa interactivo
#final_map

This first approach for the map is not bad, but observations' points are to big and color encodes no information. We are going to try encoding accidents severity through color and add an histogram for the count per borough

In [54]:
nyc_base_map = alt.Chart(ny_city_map).mark_geoshape(
    fill='lightgray', stroke='white', strokeWidth=1.3, opacity=0.4
).encode(tooltip=alt.value(None))

# Use severity color to differentiate points in three layers, one for each severity
death_points = alt.Chart(df[df['SEVERITY'] == 'Death']).mark_circle(size=10, opacity=0.8).encode(
    longitude='LONGITUDE:Q',
    latitude='LATITUDE:Q',
    color=alt.Color(
        'SEVERITY:N',
        scale=alt.Scale(
            domain=['Death', 'Injury', 'No Damage'],
            range=['red', 'orange', '#FFEB3B']
        ),
        legend=alt.Legend(title='Severity')
    ),
    tooltip=['DATETIME:N','BOROUGH:N','ZIP CODE:N','TOTAL_DEATHS:Q','TOTAL_INJURIES:Q'],
    opacity=alt.condition(
        selection_borough & selection_vehicle & selection_weather & selection_point,
        alt.value(1), alt.value(0.1)
    )
).add_params(
    selection_month
).transform_filter(
    (alt.datum.MONTH == selection_month) | (selection_month == None)
)

injury_points = alt.Chart(df[df['SEVERITY'] == 'Injury']).mark_circle(size=8, opacity=0.6).encode(
    longitude='LONGITUDE:Q',
    latitude='LATITUDE:Q',
    color=alt.Color(
        'SEVERITY:N',
        scale=alt.Scale(
            domain=['Death', 'Injury', 'No Damage'],
            range=['red', 'orange', '#FFEB3B']
        ),
        legend=alt.Legend(title='Severity')
    ),
    tooltip=['DATETIME:N','BOROUGH:N','ZIP CODE:N','TOTAL_DEATHS:Q','TOTAL_INJURIES:Q'],
    opacity=alt.condition(
        selection_borough & selection_vehicle & selection_weather & selection_point,
        alt.value(1), alt.value(0.1)
    )
).add_params(
    selection_month
).transform_filter(
    (alt.datum.MONTH == selection_month) | (selection_month == None)
)

no_damage_points = alt.Chart(df[df['SEVERITY'] == 'No Damage']).mark_circle(size=6, opacity=0.6).encode(
    longitude='LONGITUDE:Q',
    latitude='LATITUDE:Q',
    color=alt.Color(
        'SEVERITY:N',
        scale=alt.Scale(
            domain=['Death', 'Injury', 'No Damage'],
            range=['red', 'orange', '#FFEB3B']
        ),
        legend=alt.Legend(title='Severity')
    ),
    tooltip=['DATETIME:N','BOROUGH:N','ZIP CODE:N','TOTAL_DEATHS:Q','TOTAL_INJURIES:Q'],
    opacity=alt.condition(
        selection_borough & selection_vehicle & selection_weather & selection_point,
        alt.value(1), alt.value(0.1)
    )
).add_params(
    selection_month
).transform_filter(
    (alt.datum.MONTH == selection_month) | (selection_month == None)
)

# Mapa final: sumamos capas
final_map = (
    nyc_base_map + no_damage_points + injury_points + death_points
).properties(
    width=400,
    height=400
).interactive().add_params(
    selection_month,
    selection_borough,
    selection_vehicle,
    selection_weather,
    selection_point
)

In [55]:
# Crear el histograma horizontal
bar_chart = alt.Chart(df).mark_bar(size=35).encode(
    x=alt.X('count():Q', title='Number of Accidents'),
    y=alt.Y('BOROUGH:N', sort='-x', title='Borough'),
    color=alt.condition(
        selection_borough,
        alt.value('steelblue'),
        alt.value('lightgray')
    ),
    opacity=alt.condition(
        selection_borough,
        alt.value(1),
        alt.value(0.3)
    ),
    tooltip=['BOROUGH:N', 'count():Q']
).add_params(
    selection_month,
    selection_borough
).transform_filter(
    (alt.datum.MONTH == selection_month) | (selection_month == None)
).properties(
    width=300,
    height=300
)

In [56]:
# Combinar mapa y histograma
combined_chart = alt.hconcat(
    final_map,
    bar_chart
).resolve_scale(
    color='independent'
).resolve_legend(
    color='independent'
)

#combined_chart

Now we can properly distribution of accidents through neighbourhoods and their severity.

## HOURS LINE CHART

To analyze how accidents rate changes throughout a day, we are going to plot a line chart of day hours against accidents count

In [57]:
# Line chart with hours on axis X and accidents count on axis Y
hours_line_chart = alt.Chart(df).mark_line(point=True).encode(
    x=alt.X('HOUR:O',
            title='Hour of the day',
            axis=alt.Axis(labelAngle=0)),
    y=alt.Y('count():Q',
            title='Accidents count'),
    color=alt.condition(selection_point, alt.value('steelblue'), alt.value('lightgray')),
    tooltip=['HOUR:O', 'count():Q']
).transform_filter(
    (alt.datum.MONTH == selection_month) | (selection_month == None)  
).add_params(
    selection_month, selection_point 
).properties(
    width=600,
    height=400
)

#hours_line_chart

We can clearly see peak hours (16h) and lowest collision count hours (3h). Moreover day evolution is clearñy stated.

Now we are going to try to combine the days of the week with the total amount of accidents each day and see distributions using a scatterplot and a boxplot

In [58]:
week_order = ['Monday', 'Tuesday', 'Wednesday', 'Thursday', 'Friday', 'Saturday', 'Sunday']
df['WEEK_DAY'] = pd.Categorical(df['WEEK_DAY'], categories=week_order, ordered=True)

# Boxplot
boxplot = alt.Chart(df).mark_boxplot(extent='min-max').encode(
    x=alt.X('WEEK_DAY:N', title='Day of the week'),
    y=alt.Y('HOUR:Q', title='Hour of the day', scale=alt.Scale(domain=[0, 24]))
)

# Add individual obvservations
points = alt.Chart(df).mark_circle(size=10, color='black', opacity=0.3).encode(
    x=alt.X('WEEK_DAY:N'),
    y=alt.Y('HOUR:Q'),
    tooltip=['BOROUGH', 'HOUR', 'WEEK_DAY', 'CONTRIBUTING FACTOR']  # Detalles al pasar el ratón
)

# Combine boxplots and points
chart_week_distribution1 = (boxplot + points).properties(
    title='Accidents distribution throuhout the week',
    width=700,
    height=400
)

#chart_week_distribution1

This chart does not appear to be very informative as variation of distribution among days is very little and we get no insight on the week accidents distribution. Therefore, this chart will be discarted to appear on the final visualization, as opposed to teh first one that proved very informative

We are going to try plotting distributions for each day in order to find more information

In [59]:
base = alt.Chart(df).transform_filter(
    (alt.datum.MONTH == selection_month) | (selection_month == None) 
).transform_aggregate(
    count='count()', 
    groupby=['WEEK_DAY', 'HOUR']  
)


line_chart = base.mark_area(opacity=0.7, interpolate='monotone').encode(
    x=alt.X('HOUR:Q', title='Hora del día', scale=alt.Scale(domain=[0, 23])),
    y=alt.Y('count:Q', title='Accidents'),
    color=alt.Color('WEEK_DAY:N', legend=None),
    row=alt.Row('WEEK_DAY:N', title='Día de la semana')
)

chart = line_chart.add_params(
    selection_month 
).properties(
    title="Accidents distribution throughout the week",
    width=500,
    height=50 
)

#chart

Now we can see better the distribution of accidents for each week day. However, the plot is to big to enhace it in the final visualization as we do not get much insight from this plot.

## HEATMAP CHART

We are now going to see inspect distribution among months and days of the month.

In [60]:
df_filtered = df[df['MONTH'].isin(['June', 'July', 'August', 'September'])]

month_histogram = alt.Chart(df_filtered).mark_bar().encode(
    x=alt.X('MONTH:N', title='Month', sort=['June', 'July', 'August', 'September']),
    y=alt.Y('count()', title='Total de crashes'),
    tooltip=['count()']  
).properties(
    title='Total de crashes por Month',
    width=600,
    height=400
)

#month_histogram

We see no big differences among months, so we will try another approach to get more information. We will try a heatmap so we can analyze distribution of accidents throught the months

In [61]:
df['DAY'] = pd.to_numeric(df['DAY'], errors='coerce')
df['MONTH'] = pd.Categorical(df['MONTH'], categories=['June', 'July', 'August', 'September'], ordered=True)

# Select a day and month in specific
selection_day_month = alt.selection_point(fields=['DAY', 'MONTH'], name='SelectDayMonth')

base = alt.Chart(df).encode(
    x=alt.X('DAY:O', title='Day', scale=alt.Scale(domain=np.arange(1, 32))),
    y=alt.Y('MONTH:N', title='Month', scale=alt.Scale(domain=options_month)),
    tooltip=['DAY:O', 'MONTH:N', 'count():Q']  # Tooltip
).transform_filter(
    (alt.datum.MONTH == selection_month) | (selection_month == None)  # Filter month
).add_params(
    selection_month
)

heatmap = base.mark_rect().encode(
    color=alt.Color('count():Q', scale=alt.Scale(scheme='orangered'), title='Accidents Number', legend=None)
)

# Labels with count for each day
labels = base.mark_text(baseline='middle', fontSize=9).encode(
    text=alt.Text('count():Q'),
    color=alt.value('black')
)

heatmap_chart = (heatmap + labels).properties(
    width=700,  
    height=140   
)

#heatmap_chart

We can clearly see the casualties distribution thorughout the month and between months. We can also put our focus in one singular month by selecting it through the dropdown and analyze which day has the largest amount of accidents for each month or if there is some accidents concentration span of time

## WEATHER CHART

To analyze weather conditions, a straightforward way will be through an histogram

In [62]:
base = alt.Chart(df).transform_filter(
    (alt.datum.MONTH == selection_month) | (selection_month == None)
).transform_aggregate(
    count='count()',  
    groupby=['WEATHER']  
).transform_window(  
    rank='rank(count)',
    sort=[alt.SortField('count', order='descending')]
).encode(
    x=alt.X('WEATHER:N', title='Condición meteorológica', sort='-y', axis=alt.Axis(labelAngle=0)),  
    y=alt.Y('count:Q', title='Número total de accidentes'),
    tooltip=['WEATHER:N', 'count:Q']  
)

bars = base.mark_bar().encode(
    color=alt.condition(selection_weather, alt.value('steelblue'), alt.value('lightgray')),
    opacity=alt.condition(selection_weather, alt.value(1), alt.value(0.6)) 
)

wea_chart = bars.add_params(
    selection_month, selection_weather
).properties(
    title="Recuento de accidentes por condición meteorológica",
    width=300,
    height=300
)

#wea_chart

Insight is interesting but color could be used to encode more information. We are going to try analyzing severity of accidents through color, an adding emojis for an easier visual interpretation

In [63]:
weather_icon_to_emoji = {
    'rain': '🌧️',
    'clear-day': '☀️',
    'partly-cloudy-day': '⛅',
    'cloudy': '☁️'
}

df['WEATHER_EMOJI'] = df['WEATHER'].map(weather_icon_to_emoji)

base = alt.Chart(df).transform_filter(
    (alt.datum.MONTH == selection_month) | (selection_month == None)
).transform_aggregate(
    count='count()',
    groupby=[
        'MONTH', 'WEEK_DAY', 'BOROUGH', 'VEHICLE TYPE', 'HOUR',
        'WEATHER', 'DAY', 'SEVERITY', 'WEATHER_EMOJI'
    ]
).encode(
    x=alt.X('WEATHER:N', title='Weather condition', sort='-y', axis=alt.Axis(labelAngle=0)),
    y=alt.Y('sum(count):Q', title='Accidents count'),
    color=alt.Color(
        'SEVERITY:N',
        scale=alt.Scale(
            domain=['No Damage', 'Injury', 'Death'],
            range=['#FFEB3B', 'orange', 'red']
        ),
        legend=None
    ),
    tooltip=['WEATHER:N', 'SEVERITY:N', 'sum(count):Q']
)

bars = base.mark_bar().encode(
    opacity=alt.condition(selection_weather, alt.value(1), alt.value(0.6))
)

emojis = base.mark_text(
    align='center', baseline='bottom', dy=-40, size=20
).transform_filter(
    alt.datum.SEVERITY == 'No Damage'  # Solo un emoji por barra
).encode(
    text='WEATHER_EMOJI:N'
)

weather_chart = alt.layer(
    bars, emojis
).add_params(
    selection_month, selection_weather
).properties(
    width=300,
    height=400
)

# Mostrar el gráfico
#weather_chart

## VEHICLE TYPE

We first try an histogram for encoding accumulation of vehicle accidents.

In [64]:
# Selección interactiva para VEHICLE TYPE y WEATHER
selection_vehicle = alt.selection_point(fields=['VEHICLE TYPE'], name='SelectVehicle')
selection_weather = alt.selection_point(fields=['WEATHER'], name='SelectWeather')

# Base del histograma
base = alt.Chart(df).transform_filter(
    (alt.datum.MONTH == selection_month) | (selection_month == None)  # Filtrar por mes seleccionado
).transform_aggregate(
    count='count()',
    groupby=['VEHICLE TYPE', 'WEATHER']  # Agrupar por tipo de vehículo y condición meteorológica
).encode(
    x=alt.X('VEHICLE TYPE:N', title='Tipo de vehículo', axis=alt.Axis(labelAngle=0)),
    y=alt.Y('count:Q', title='Número total de accidentes'),
    tooltip=['VEHICLE TYPE:N', 'WEATHER:N', 'count:Q']
)

# Histograma con barras apiladas y resaltado interactivo
histogram_vehicles = base.mark_bar(opacity=0.8).encode(
    opacity=alt.condition(selection_vehicle | selection_weather, alt.value(1), alt.value(0.4))
).add_params(
    selection_month, selection_vehicle, selection_weather
)

# Combinar con títulos y dimensiones
vehicle_chart = histogram_vehicles.properties(
    title="Distribución de accidentes por tipo de vehículo y condición meteorológica",
    width=700,
    height=400
)

#vehicle_chart

We can see a big difference among classes, which makes it difficult to analyze. We will try to plot instead contributing factors and encode with colors each vehicle type

In [65]:
histogram = (
    alt.Chart(df)
    .mark_bar()
    .encode(
        y=alt.Y('CONTRIBUTING FACTOR:N', sort='-x', axis=alt.Axis(title=None)),
        x=alt.X('count()', title='Accidents count'),
        color=alt.Color(
            'VEHICLE TYPE:N', 
            title='Vehicle Type',
            scale=alt.Scale(scheme='tableau10')),
        tooltip=['VEHICLE TYPE', 'count()']
    )
    .transform_filter(
        (alt.datum.MONTH == selection_month) | (selection_month == None)
    )
    .add_params(selection_month)
    .add_params(selection_vehicle)
    .encode(
        opacity=alt.condition(selection_vehicle, alt.value(1), alt.value(0.3))
    )
    .properties(
        title='Contributing factors',
        width=600,
        height=400
    )
)

#histogram

Still vehicle types difference is quite remarkable but with colors we have a better image of their distribution, as well as relation with contributing factors. As per the last, we can clearly see differences between the most common contributing factor, Improper driving and traffic rule violation, as opposed to the least which includes uncontrolled vehicles.

## CHARTS COMBINATION

Now we want to join selected charts in a final visualizations dashboard, including interactions among charts

In [66]:
# Interactions
bar_chart = bar_chart.transform_filter(selection_vehicle & selection_borough & selection_weather & selection_point)
histogram = histogram.transform_filter(selection_vehicle & selection_borough & selection_weather & selection_point)
weather_chart = weather_chart.transform_filter(selection_vehicle & selection_borough & selection_weather & selection_point)
hours_line_chart = hours_line_chart.transform_filter(selection_vehicle & selection_borough  & selection_weather & selection_point)
heatmap_chart = heatmap_chart.transform_filter(selection_vehicle & selection_borough & selection_weather & selection_point)

# Combine the charts into the final layout
final_dashboard = alt.vconcat(
    alt.hconcat(final_map, bar_chart, histogram, spacing=10).resolve_scale(color='independent'),
    alt.hconcat(biplot, weather_chart, hours_line_chart, spacing=10).resolve_scale(color='independent'),
    alt.hconcat(heatmap_chart, spacing=10).resolve_scale(color='independent'),
    title="Traffic Accidents Dashboard - NYC"
).resolve_scale(
    color='independent',
    opacity='independent'
).configure_title(anchor='start')

In [69]:
# Display the dashboard
#final_dashboard

In [68]:
final_dashboard.save('traffic_accidents_dashboard.html')