# Data Report
This document is to record the data that our team will be using throughout this project, including the structure and statistics of the raw data. It will answer questions such as how the data will be used to answer business objectives.

Any outstanding questions and scope issues related to the data.



## Sources

### Seattle Traffic Collisions Dataset
This dataset maintained by the city of seattle contains traffic collision report details from 2004 to present. This will be our main datasource for collision data.
https://data-seattlecitygis.opendata.arcgis.com/datasets/collisions
https://www.kaggle.com/jonleon/seattle-sdot-collisions-data/data
https://www.seattle.gov/Documents/Departments/SDOT/GIS/Collisions_OD.pdf

The datasource itself seems mostly cleaned, although there are a few questions for which we have both reached out for answers and had to make assumptions:
We have, in joint effort with other teams, created a list of questions which can be found [here](https://docs.google.com/document/d/1did1D2Ls1a_kntCfiArpjCI5PuB15xR-uHPdWUXX6PI/edit).
Here are some of the ambiguities:
1. REPORTNO: Why do report numbers start with ‘E’ others with ‘EA’ others with ‘C’ and others with no prefix letter?
1. EXCEPTRSNCODE: What means this field?
1. SEVERITYCODE: How are these values computed? 3 and 2 are simple enough, however, some 2b are assigned to rows with no injuries how can this be? Also what logic went into 0 and 1 codes? about 90% of the 0/1 cases are clearly separable, but there are still thousands of confusing rows. For instance, a car hits a parked car under the influence no injuries--marked as unknown. why not mark this as prop damage?
1. UNDERINFL: What happened/Why did the data collection system change from binary to Y/N in 2009/2010? Can we assume that 1≡Y and 0≡N?
1. STATUS/WEATHER/ROADCOND/LIGHTCOND: Why when Unmatched are also weather, roadcond, and lightcond missing?
1. STATUS: What means Matched and Unmatched?
1. SPEEDING: Can it safely be assumed that all missing values are not speeding? In what cases should they be kept missing? For example, since Unmatched is correlated with missing values in weather, roadcond, and lightcond, should speeding be missing for Unmatched as well?
1. The data appears to have been merged from separate systems over time. Also existence of SDOT_COLCODE and ST_COLCODE leads one to believe that there might presently be two systems. Are there presently two systems that collision reports come through, i.e. state vs seattle department of transportation? 

Here is a full list of the features:
* x
* y
* status
* addrtype
* intkey
* location
* exceptrsncode
* severitycode
* collisiontype
* personcount
* pedcount
* pedcylcount
* vehcount
* injuries
* seriousinjuries
* fatalities
* incdttm
* junctiontype
* sdot_colcode
* inattentionind
* underinfl
* weather
* roadcond
* lightcond
* pedrownotgrnt
* speeding
* st_colcode
* seglanekey
* crosswalkkey
* hitparkedcar

### Seattle Streets Dataset
https://data-seattlecitygis.opendata.arcgis.com/datasets/seattle-streets/
https://www.seattle.gov/Documents/Departments/SDOT/GIS/Seattle_Streets_OD.pdf


Depending on the ADDRTYPE of a collision, Block collisions will, have a location field value equal to the road segment UNITDESC field value. Once paired, we are able to see many additional attributes about the road upon which a collision occurred:
Here is a full list of features in this dataset:

* artclass
* compkey
* unitid
* unitid2
* unitidsort
* stname_ord
* xstrlo
* xstrhi
* artdescript
* owner
* status
* blocknbr
* speedlimit
* segdir
* oneway
* onewaydir
* flow
* seglength
* surfacewidth
* surfacetype_1
* surfacetype_2
* intrlo
* dirlo
* intkeylo
* intrhi
* dirhi
* nationhwysys
* streettype
* pvmtcondindx1
* pvmtcondindx2
* tranclass
* trandescript
* slope_pct
* pvmtcategory
* parkboulevard
* shape_length


### Seattle Intersections Dataset
http://data-seattlecitygis.opendata.arcgis.com/datasets/intersections
https://www.seattle.gov/Documents/Departments/SDOT/GIS/Intersections_OD.pdf

Depending on the ADDRTYPE of a collision, Intersection collisions will have a location equal to the intersection data UNITDESC field.

We are not currently aware of any useful signal in this dataset, but we have included it for completeness.

Here are a full list of attributes:
* x
* y
* intr_id
* gis_xcoord
* gis_ycoord
* compkey
* comptype
* unitid
* subarea
* arterialclasscd
* signal_maint_dist
* signal_type
* shape_lng
* shape_lat

Intersection data question:
1. Why in the system do intersections exist outside of and separate from road segments? Why do intersections not have many data fields that road segments have, for example speed limit only exists for road segment. Also slope percent, flow, surfacd type, and surface width do not exist for intersections. Can these values be safely imputed as, say, the min/max/average of either road at an intersection?

Regarding collisions that occur at intersections, there are many fields from the street data that are inherently missing because the street data only corresponds directly by key for collisions that occur on road segments. Intersections are betweeen road segments, and thus if a value can sensibly be assumed to exist at an intersection as well, it must be imputed