# Seattle Traffic Collision Data
### Rebecca Stewart

## Analysis of numeric features that represent counts:

#### Features that represent number of people involved in the collision
* personcount - the total number of people involved
* pedcount- the total number of pedestrians involved
* pedcylcount - the total number of cyclists involved

#### Features that represent number of people involved
* vehcount - the total number of vehicles involved

#### Features that represent number of injuries resulting from collision
* injuries - the total number of injuries other than fatal or disabling at the scene, including broken fingers or toes, abrasions, etc.
* seriousinjuries - total number of injuries that result in at least a temporary impairment, e.g. a broken limb. It does not mean that the collision resulted in a permanent disability
* fatalities - includes the total number of persons who died at the scene of the collisions, were dead on arrival at the hospital, or died within 30 days of the collision from collision-related injuries

#### Other interesting features that might relate to or give insight on these count features
* severitydesc - a description of the collision, e.g. Property Damage Only Collision, Injury Collision
* collisiontype - a description of the collision type, e.g. Parked Car, Rear Ended, Sideswipe
* sdot_coldesc - a description of the collision corresponding to the collision code (Motor Vehicle Struck Motor Vehicle, Rear End)
* st_colddesc -  - a description that corresponds to the state’s coding designation (Vehicle Going Straight Hits Pedestrian)
* hitparkedcar - hit parked car - Y/N
* crosswalkkey - a key for the crosswalk at which the collision occurred
* pedrownotgrnt - whether or not the pedestrian right of way was not granted. (Y/N)

#### Some initial questions;
1. Does personcount include pedcount?
2. Does personcount include pedcylcount?
3. Does injuries include seriousinjuries?
4. Does injuries include fatalities?
5. If the collision is marked as hitparkedcar = Y, does the 'number of vehicles involved' include the pared car?
6. Is there a consistent pattern involving severitydesc and injuries/seriousinjuries/fatalities
7. Is there a consistent pattern involving collisiontype and injuries/seriousinjuries/fatalities
8. Are there only values for pedrownotgrnt if pedcount > 0 or pedcylcount>0

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
pd.options.display.max_rows = 500
pd.options.display.max_columns = 100
import warnings
warnings.filterwarnings("ignore") 
from datetime import datetime

from IPython.display import display, Markdown
# Display all output within each cell
from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = 'all'

### Load, review and clean-up data

In [None]:
# READ IN THE DATA FROM THE LOCAL CSV FILE
local_file_name  = "../data/collisions_orig.csv"
df = pd.read_csv(local_file_name , parse_dates=["INCDTTM"])

In [26]:
print("original df shape:", df.shape)

original df shape: (220436, 40)


In [46]:
# CHANGE CASE OF FEATURE NAMES  
df.columns = map(str.lower, df.columns)
df.head(3)

Unnamed: 0,x,y,objectid,inckey,coldetkey,reportno,status,addrtype,intkey,location,exceptrsncode,exceptrsndesc,severitycode,severitydesc,collisiontype,personcount,pedcount,pedcylcount,vehcount,injuries,seriousinjuries,fatalities,incdate,incdttm,junctiontype,sdot_colcode,sdot_coldesc,inattentionind,underinfl,weather,roadcond,lightcond,pedrownotgrnt,sdotcolnum,speeding,st_colcode,st_coldesc,seglanekey,crosswalkkey,hitparkedcar
0,-122.340472,47.608629,1,18600,18600,1785104,Matched,Intersection,29598.0,PIKE PL AND PIKE ST,,,2,Injury Collision,Pedestrian,2,1,0,1,1,0,0,2004/10/14 00:00:00+00,2004-10-14 18:36:00,At Intersection (intersection related),24.0,MOTOR VEHCILE STRUCK PEDESTRIAN,,0,Overcast,Dry,Dark - Street Lights On,,4288030.0,,3.0,Vehicle backing hits pedestrian,0,0,N
1,-122.251788,47.508176,2,328272,329772,EA07021,Unmatched,Block,,S PRENTICE ST BETWEEN 65TH AVE S AND 66TH AVE S,NEI,"Not Enough Information, or Insufficient Locati...",1,Property Damage Only Collision,,2,0,0,0,0,0,0,2020/01/22 00:00:00+00,2020-01-22 00:00:00,Mid-Block (not related to intersection),12.0,"MOTOR VEHICLE STRUCK MOTOR VEHICLE, RIGHT SIDE...",,,,,,,,,,,0,0,Y
2,-122.328526,47.70318,3,328374,329874,EA09347,Matched,Intersection,37555.0,1ST AVE NE AND NE 103RD ST,,,1,Property Damage Only Collision,Angles,4,0,0,2,0,0,0,2020/01/05 00:00:00+00,2020-01-05 13:28:00,At Intersection (intersection related),11.0,"MOTOR VEHICLE STRUCK MOTOR VEHICLE, FRONT END ...",,N,Raining,Wet,Daylight,,,,10.0,Entering at angle,0,0,N


#### For categorical features, let's go ahead and convert missing values and 'other' to 'Unknown

For pedrownotgrnt, lets convert null to N so that we have Y/N values
For underinfl, we can consolidate the existing values, but, for those that are null, let's leave them as X

In [62]:
print(df["underinfl"].value_counts())

N    103000
0     81676
Y      5398
1      4230
Name: underinfl, dtype: int64


In [64]:
df["underinfl"] = df["underinfl"].replace({"0": "N", "1": "Y", np.nan: "X"})
print(df["underinfl"].value_counts())

N    184676
X     26132
Y      9628
Name: underinfl, dtype: int64


In [65]:
categories = ["addrtype", "collisiontype", "severitydesc", "lightcond",
             "speeding", "junctiontype", "roadcond", "weather"]

for col in categories:
    df[col] = df[col].replace({np.nan: "Unknown", "Other": "Unknown"})
    
df["pedrownotgrnt"] = df["pedrownotgrnt"].replace({np.nan: "N"})    
df["st_coldesc"] = df["st_coldesc"].replace({np.nan: "Unknown", "Other": "Unknown"})    

for col in categories+ ['pedrownotgrnt', 'st_coldesc', 'underinfl']:
    print("\n{}: {} unique and {} null".format(col,
                                               df[col].nunique(dropna=False),
                                               df[col].isna().sum())) 
    
 


addrtype: 4 unique and 0 null

collisiontype: 10 unique and 0 null

severitydesc: 5 unique and 0 null

lightcond: 8 unique and 0 null

speeding: 2 unique and 0 null

junctiontype: 7 unique and 0 null

roadcond: 8 unique and 0 null

weather: 10 unique and 0 null

pedrownotgrnt: 2 unique and 0 null

st_coldesc: 63 unique and 0 null

underinfl: 3 unique and 0 null


In [53]:
#LOOK AT THE FIRST THREE ROWS OR THE COLUMNS WE ARE INTERESTED IN
num_count_cols=['personcount','pedcount','pedcylcount','vehcount','injuries','seriousinjuries','fatalities']
interest_cols=['severitydesc','collisiontype','sdot_coldesc','st_coldesc','hitparkedcar','crosswalkkey','pedrownotgrnt']

display(df[num_count_cols].head(10))

Unnamed: 0,personcount,pedcount,pedcylcount,vehcount,injuries,seriousinjuries,fatalities
0,2,1,0,1,1,0,0
1,2,0,0,0,0,0,0
2,4,0,0,2,0,0,0
3,2,0,0,2,0,0,0
4,0,0,0,0,0,0,0
5,2,0,0,0,1,0,0
6,2,0,0,2,0,0,0
7,2,0,0,0,0,0,0
8,2,0,0,2,0,0,0
9,4,0,0,2,0,0,0


### Person Counts

In [54]:
df.groupby(['personcount','pedcount','pedcylcount']).size().reset_index().rename(columns={0:'count'})

Unnamed: 0,personcount,pedcount,pedcylcount,count
0,0,0,0,24510
1,0,0,1,187
2,0,0,2,2
3,0,1,0,216
4,0,1,1,1
5,0,2,0,11
6,0,2,1,1
7,1,0,0,13513
8,1,0,1,241
9,1,0,2,1


#### Answer to both questions 1 & 2 is no, they are not included
Since there are many records that have pedcount (pedestrian count) and/or pedcylcount (cyclist) greater than zero, while personcount is zero, it is clear that personcount does not automatically include pedcount (pedestrian count) or pedcylcount (cyclist).

#### Interesting Follow-up Question
There are 24,540 collisions where not people at all were involved. What kind of accidents would account for this?

### Analyze records where all person counts are zero

In [55]:
# Create a dataset of just those collisions where supposedly no people were involved
df_no_people = df.loc[(df["personcount"] == 0) & (df["pedcount"] == 0) & (df["pedcylcount"] == 0)]
df_no_people[['vehcount','injuries','seriousinjuries','fatalities', 'severitydesc','collisiontype','sdot_coldesc','st_coldesc','hitparkedcar','crosswalkkey','pedrownotgrnt']].head(5)

Unnamed: 0,vehcount,injuries,seriousinjuries,fatalities,severitydesc,collisiontype,sdot_coldesc,st_coldesc,hitparkedcar,crosswalkkey,pedrownotgrnt
4,0,0,0,0,Unknown,Unknown,"MOTOR VEHICLE STRUCK MOTOR VEHICLE, FRONT END ...",Unknown,Y,0,N
12,0,0,0,0,Unknown,Unknown,NOT ENOUGH INFORMATION / NOT APPLICABLE,Unknown,N,0,N
15,0,0,0,0,Unknown,Unknown,"MOTOR VEHICLE STRUCK MOTOR VEHICLE, FRONT END ...",Unknown,Y,0,N
25,0,0,0,0,Unknown,Unknown,NOT ENOUGH INFORMATION / NOT APPLICABLE,Unknown,Y,0,N
31,0,0,0,0,Unknown,Unknown,NOT ENOUGH INFORMATION / NOT APPLICABLE,Unknown,Y,0,N


#### Something strange is going on here...
It looks like these are real collisions involving vehicles, but people counts, vehicle counts and injury counts are all zeros. This represents about 10% of our data. Need to keep this in mind when analysis includes vehcount and/or person counts. Might want to drop records that are zero for all these counts.  

#### A further look at this subset of data:
How can there be no people involved if there are vehicles and/or injuries involved?

In [56]:
print("Number of records where zero, one or more vehicles were involved when no people were involved")
df_no_people['vehcount'].value_counts()

Number of records where zero, one or more vehicles were involved when no people were involved


0     19352
2      4266
3       391
1       391
4        81
5        19
6         6
7         2
11        1
9         1
Name: vehcount, dtype: int64

In [57]:
print("Number of records where there were zero, one or more injuries when no people were involved")
df_no_people['injuries'].value_counts()

Number of records where there were zero, one or more injuries when no people were involved


0    23092
1     1052
2      269
3       64
4       21
5        9
9        1
8        1
6        1
Name: injuries, dtype: int64

In [58]:
print("Severity of collisions for those when supposedly no people were involved - most of these are listed as Unknown")
df_no_people['severitydesc'].value_counts()

Severity of collisions for those when supposedly no people were involved - most of these are listed as Unknown


Unknown                           19349
Property Damage Only Collision     3741
Injury Collision                   1400
Serious Injury Collision             19
Fatality Collision                    1
Name: severitydesc, dtype: int64

In [59]:
print("Collisiontype for collisions for those when supposedly no people were involved - most of these are listed as Unknown")
df_no_people['collisiontype'].value_counts()


Collisiontype for collisions for those when supposedly no people were involved - most of these are listed as Unknown


Unknown       19935
Angles         1223
Parked Car     1111
Rear Ended     1038
Sideswipe       614
Left Turn       432
Right Turn       94
Head On          63
Name: collisiontype, dtype: int64

In [60]:
print("Descriptions for collisions for those when supposedly no people were involved - most of these are listed as Unknown")
df_no_people['sdot_coldesc'].value_counts()


Descriptions for collisions for those when supposedly no people were involved - most of these are listed as Unknown


NOT ENOUGH INFORMATION / NOT APPLICABLE                         8265
MOTOR VEHICLE STRUCK MOTOR VEHICLE, FRONT END AT ANGLE          7338
MOTOR VEHICLE STRUCK MOTOR VEHICLE, REAR END                    5682
MOTOR VEHICLE STRUCK MOTOR VEHICLE, LEFT SIDE SIDESWIPE         1021
MOTOR VEHICLE STRUCK MOTOR VEHICLE, LEFT SIDE AT ANGLE           626
MOTOR VEHICLE RAN OFF ROAD - HIT FIXED OBJECT                    490
MOTOR VEHICLE STRUCK OBJECT IN ROAD                              321
MOTOR VEHICLE STRUCK MOTOR VEHICLE, RIGHT SIDE SIDESWIPE         209
MOTOR VEHICLE STRUCK MOTOR VEHICLE, RIGHT SIDE AT ANGLE          180
MOTOR VEHCILE STRUCK PEDESTRIAN                                  160
MOTOR VEHICLE STRUCK PEDALCYCLIST, FRONT END AT ANGLE             70
MOTOR VEHICLE OVERTURNED IN ROAD                                  38
PEDALCYCLIST STRUCK MOTOR VEHICLE FRONT END AT ANGLE              26
DRIVERLESS VEHICLE RAN OFF ROAD - HIT FIXED OBJECT                12
MOTOR VEHICLE STRUCK TRAIN        

In [61]:
print("hitparkedcar for collisions for those when supposedly no people were involved - most of these are listed as Unknown")
df_no_people['hitparkedcar'].value_counts()

hitparkedcar for collisions for those when supposedly no people were involved - most of these are listed as Unknown


N    20719
Y     3791
Name: hitparkedcar, dtype: int64

### Vehicle Counts

In [67]:
df.groupby(['vehcount','personcount']).size().reset_index().rename(columns={0:'count'})

Unnamed: 0,vehcount,personcount,count
0,0,0,19357
1,0,1,1816
2,0,2,4666
3,0,3,359
4,0,4,177
5,1,0,790
6,1,1,11831
7,1,2,12221
8,1,3,2061
9,1,4,538


#### Analysis of Vehicle Count = 0

Since we have already looked at those records where person count and vehicle count are both zero, let's just look at those where person count is greater than zero

In [68]:
df_no_vehicle = df.loc[(df["vehcount"] == 0) & (df["personcount"]> 0) ]

In [72]:
print("collisiontype for collisions for those when supposedly no vehicles were involved")
df_no_vehicle['collisiontype'].value_counts()

collisiontype for collisions for those when supposedly no vehicles were involved


Unknown       6781
Cycles         235
Pedestrian       2
Name: collisiontype, dtype: int64

I'm not surprised that most of these are marked as Unknown, that may be why number of vehicles = 0, they just didn't know how many were involved, and so listed zero instead of 'unknown'

In [70]:
print("severitydesc for collisions for those when supposedly no vehicles were involved")
df_no_vehicle['severitydesc'].value_counts()

severitydesc for collisions for those when supposedly no vehicles were involved


Property Damage Only Collision    3630
Unknown                           2160
Injury Collision                  1134
Serious Injury Collision            85
Fatality Collision                   9
Name: severitydesc, dtype: int64

I would guess that 'Property Damage Only Collision' and 'Unknown' are collisions that were reported after the fact, which probably means certain data was not collected, like how many vehicles were involved. Again, defaulting to zero instead of 'unknown'.

In [73]:
print("hitparkedcar for collisions for those when supposedly no vehicles were involved")
df_no_vehicle['hitparkedcar'].value_counts()

hitparkedcar for collisions for those when supposedly no vehicles were involved


N    5636
Y    1382
Name: hitparkedcar, dtype: int64

Clearly those collision records where a parked car was hit should have vehicle count of at least 2 (unless it was a bicycle hitting a parked car), in which case it should be at least 1.

In [74]:
df_no_vehicle['sdot_coldesc'].value_counts()

NOT ENOUGH INFORMATION / NOT APPLICABLE                                 2017
MOTOR VEHICLE STRUCK MOTOR VEHICLE, FRONT END AT ANGLE                  1936
MOTOR VEHICLE STRUCK MOTOR VEHICLE, REAR END                            1585
MOTOR VEHICLE STRUCK MOTOR VEHICLE, LEFT SIDE AT ANGLE                   507
MOTOR VEHICLE STRUCK MOTOR VEHICLE, RIGHT SIDE AT ANGLE                  173
MOTOR VEHICLE STRUCK MOTOR VEHICLE, LEFT SIDE SIDESWIPE                  147
MOTOR VEHCILE STRUCK PEDESTRIAN                                          124
MOTOR VEHICLE RAN OFF ROAD - HIT FIXED OBJECT                             96
MOTOR VEHICLE STRUCK PEDALCYCLIST, FRONT END AT ANGLE                     86
PEDALCYCLIST STRUCK PEDESTRIAN                                            81
MOTOR VEHICLE STRUCK OBJECT IN ROAD                                       81
PEDALCYCLIST OVERTURNED IN ROAD                                           62
MOTOR VEHICLE STRUCK MOTOR VEHICLE, RIGHT SIDE SIDESWIPE                  41

We could use this information to update vehicle count to some number other than zero (accept the NEI records)

### Injury Counts

In [66]:
df.groupby(['injuries','seriousinjuries','fatalities']).size().reset_index().rename(columns={0:'count'})

Unnamed: 0,injuries,seriousinjuries,fatalities,count
0,0,0,0,158536
1,0,0,1,220
2,0,0,2,4
3,0,0,3,1
4,0,0,4,1
5,0,1,0,6
6,1,0,0,44736
7,1,0,1,49
8,1,0,2,1
9,1,1,0,2236


#### Answer to both questions 3 & 4 is no, they are not included

Since there are records where injuries are zero and serious injuries and/or fatalities are greater than zero, we can assume that injuries does not include those that are serious or fatal.

### Other interesting questions

In [20]:
display(df[interest_cols].head(10))

Unnamed: 0,severitydesc,collisiontype,sdot_coldesc,st_coldesc,hitparkedcar,crosswalkkey,pedrownotgrnt
0,Injury Collision,Pedestrian,MOTOR VEHCILE STRUCK PEDESTRIAN,Vehicle backing hits pedestrian,N,0,
1,Property Damage Only Collision,,"MOTOR VEHICLE STRUCK MOTOR VEHICLE, RIGHT SIDE...",,Y,0,
2,Property Damage Only Collision,Angles,"MOTOR VEHICLE STRUCK MOTOR VEHICLE, FRONT END ...",Entering at angle,N,0,
3,Property Damage Only Collision,Sideswipe,"MOTOR VEHICLE STRUCK MOTOR VEHICLE, FRONT END ...",From same direction - both going straight - bo...,N,0,
4,Unknown,,"MOTOR VEHICLE STRUCK MOTOR VEHICLE, FRONT END ...",,Y,0,
5,Injury Collision,,"MOTOR VEHICLE STRUCK MOTOR VEHICLE, FRONT END ...",,N,0,
6,Property Damage Only Collision,Other,"MOTOR VEHICLE STRUCK MOTOR VEHICLE, REAR END",From same direction - all others,N,0,
7,Property Damage Only Collision,,"MOTOR VEHICLE STRUCK MOTOR VEHICLE, FRONT END ...",,Y,0,
8,Property Damage Only Collision,Other,"MOTOR VEHICLE STRUCK MOTOR VEHICLE, REAR END",One car entering driveway access,N,0,
9,Property Damage Only Collision,Left Turn,"MOTOR VEHICLE STRUCK MOTOR VEHICLE, FRONT END ...",From opposite direction - one left turn - one ...,N,0,


If the collision is marked as hitparkedcar = Y, does the 'number of vehicles involved' include the pared car?

Let's only look at those where number of vehicles is greater than zero

In [83]:
pd.options.display.max_colwidth = 150
df_hit_parked_car = df.loc[(df["hitparkedcar"] == 'Y')]
df_hit_parked_car=df_hit_parked_car.loc[(df["vehcount"] >0)]
df_hit_parked_car.groupby(['vehcount','sdot_coldesc']).size().reset_index().rename(columns={0:'count'})

Unnamed: 0,vehcount,sdot_coldesc,count
0,1,DRIVERLESS VEHICLE RAN OFF ROAD - HIT FIXED OBJECT,1
1,1,MOTOR VEHCILE STRUCK PEDESTRIAN,1
2,1,MOTOR VEHICLE RAN OFF ROAD - HIT FIXED OBJECT,2
3,1,"MOTOR VEHICLE STRUCK MOTOR VEHICLE, FRONT END AT ANGLE",4
4,1,"MOTOR VEHICLE STRUCK MOTOR VEHICLE, LEFT SIDE SIDESWIPE",2
5,1,"MOTOR VEHICLE STRUCK MOTOR VEHICLE, REAR END",15
6,1,"MOTOR VEHICLE STRUCK MOTOR VEHICLE, RIGHT SIDE AT ANGLE",1
7,1,"MOTOR VEHICLE STRUCK MOTOR VEHICLE, RIGHT SIDE SIDESWIPE",1
8,1,MOTOR VEHICLE STRUCK OBJECT IN ROAD,1
9,1,"MOTOR VEHICLE STRUCK PEDALCYCLIST, REAR END",1


We can definitely see some inconsistencies where the vehicle count is one. Only some of these make sense – the ones where a cyclist runs into a vehicle, but most the others (even the ones where a driverless vehicle stuck another vehicle) seem like they should involve more than one vehicle.

#### Answer to questions 5 – yes, it’s included.

Most of these records have vehicle count more than one, which leads me to believe that the parked car is almost always included in the vehicle count.


#### Answer to questions 6 –

Is there a consistent pattern involving severitydesc and injuries/seriousinjuries/fatalities

#### Answer to questions 7 –

Is there a consistent pattern involving collisiontype and injuries/seriousinjuries/fatalities

#### Answer to questions 9 –

Are there only values for pedrownotgrnt if pedcount > 0 or pedcylcount>0


### Conclusions

Describe conclusions here