In [1]:
import pandas as pd
import altair as alt

dir = '../Data'

weather = pd.read_csv(f'{dir}/weather_clean.csv')
collisions = pd.read_csv(f'{dir}/collisions_clean.csv')

alt.data_transformers.enable("vegafusion")

DataTransformerRegistry.enable('vegafusion')

In [2]:
colors = ({
    'sequential': ['#5803AD', '#7FDFA1'],
    'categorical': []
})

In [3]:
# select the year from the CRASH DATE column
collisions['YEAR'] = collisions['CRASH DATE'].astype(str).str[:4]

## At what time of the day are accidents more common?

In [4]:
collisions['YEAR'] = collisions['CRASH DATE'].astype(str).str[:4]

Since we aggregated the count accidents by time of the day in hours, we consider that a line chart is a great option to represent the evolution of total car crashes per hours. Also, this type of chart allows to separate the data by years (2018, 2020) very easlily to add extra information. As we can see the result clearly shows the top hours with more crashes and allows an easy comparison between hours and years.

In [5]:
df = collisions[['COLLISION_ID', 'CRASH TIME INTERVAL', 'YEAR']]

alt.Chart(df).mark_area(
    point=True,
    fillOpacity=0.8,
    line=True,
    tooltip=True
    # interpolate='monotone'
).transform_aggregate(
    count='count()',
    groupby=['YEAR', 'CRASH TIME INTERVAL']
).encode(
    x=alt.X('CRASH TIME INTERVAL:O', title='Hour of the Day', axis=alt.Axis(labelAngle=0)),
    y=alt.Y('count:Q', title='Number of Collisions', stack=None),
    color=alt.Color('YEAR:N', title='Year', scale=alt.Scale(domain=['2018', '2020'], range=['lightblue', 'lightgreen'])),
    tooltip=['count:Q', 'CRASH TIME INTERVAL:O']
).properties(
    title='Number of Collisions by Hour of the Day'
)

Another option we considered is a heatmap. This time we also add the day of the week and not only the time of the day, so this visualization not only helps knowing in which time of the day there are more crashes but also allows us to answer the question that asks us to analyze if there are more crashes in weekends or in weekdays, giving more detail and context to the first question. We consider the result to be pretty useful and clear. The only issue is that this visualization does not allow to compare between years, however this is not an important problem since, as mentioned, there is a specific visualization for that task.

In [6]:
collisions.head(1)

Unnamed: 0,COLLISION_ID,DAY NAME,CRASH DATE,CRASH MOMENT,CRASH TIME INTERVAL,BOROUGH,ZIP CODE,STREET NAME,NUMBER OF INJURED,NUMBER OF KILLED,CONTRIBUTING FACTOR VEHICLE 1,VEHICLE TYPE CODE 1,LATITUDE,LONGITUDE,YEAR
0,4345591,Sunday,2020-09-06 00:00:00+00:00,Night,18,QUEENS,,WHITESTONE EXPRESSWAY,0.0,2.0,Unsafe Lane Changing,VAN,40.771038,-73.83413,2020


In [7]:
c1 = alt.Chart(collisions[['CRASH TIME INTERVAL', 'DAY NAME', 'YEAR']]).mark_rect(
    tooltip=True
).encode(
    x=alt.X('CRASH TIME INTERVAL:N',
            title='Hour of the Day',
            axis=alt.Axis(labelAngle=0)),
    y=alt.Y('DAY NAME:N',
            sort=['Monday', 'Tuesday', 'Wednesday', 'Thursday', 'Friday', 'Saturday', 'Sunday'],
            title='Day of the Week'),
    color=alt.Color('count():Q',
                    title='Number of Collisions',
                    scale=alt.Scale(range=['#f0fff1', '#5603ad'])),
    tooltip=['count()']
).properties(
    title='Number of Collisions by Hour of the Day and Day of the Week in 2018'
)

c1

While the line chart is a great option, we concluded that the best option for this case is the heatmap. This is mainly because it gives an extra layer of detail and this detail complements the visualization for the first question. The color palette chosen also allows people affected by color blindness to be able to draw conclusions without any difficulties.