# Exploring the data

Before doing anything, we need to know what the data structure is.

This file experiments with the data structure and investigates all the features

In [None]:
import pandas as pd
import numpy as np
import plotly.express as px
import sys

ROOT_DIR = '../'
sys.path.insert(1, '../production_code/')
from constants import *

2000_to_20005 is a whole seperate dataset, same columns, might ignore it for now, filter by dates

## from 


| csv name | from documentation | from experimentation |
| ---| --- | --- |
| ACCIDENT_CHAINAGE |  | more specific location data |
| ACCIDENT_EVENT | accident_event (sequence of events e.g. left road, rollover, caught fire) | + location of impact on car |
| ACCIDENT_LOCATION |  | geographical location (street names no GPS) |
| ACCIDENT | accident (basic accident details, time, severity, location) | Main dataframe, general info |
| ATMOSPHERIC_COND | atmospheric_cond (rain, winds etc) | |
| NODE_ID_COMPLEX_INT_ID |  |  for breaking down complex intersections |
| NODE | Node Table with Lat/Long references | Lat/long and LGA areas |
| PERSON | person (person based details, age, sex etc) | |
| ROAD_SURFACE_COND | road_surface_cond (whether road was wet, dry, icy etc)  | |
| Statistic Checks |  | some general stats |
| SUBDCA | sub_dca (detailed codes describing accident) | eg vehicle entering or leaving intersection etc.. |
| VEHICLE | vehicle (vehicle based data, vehicle type, make etc) | |

## from documentation

- accident_node (master location table - NB subset of accident table) 


## features of interest

| Feature | description | dataset |
| ---- | ---- | ---- |
| ACCIDENT_NO | individual accident id, one id for each accident | all |
| NODE_ID | locational node where the incident occured, 0 is unable to locate | NODE |
| ACCIDENTDATE | date | ACCIDENT |
| ACCIDENTTIME | time | ACCIDENT |
| ATMOSPH_COND | int 0-9, General weather data, check crash stats apendecies | ATMOSPHERIC_COND |
| ATMOSPH_COND_SEQ | int 0-4, not sure, ignoring | ATMOSPHERIC_COND |
| LIGHT_CONDITION | | ACCIDENT
| ACCIDENT_TYPE | type of colision, hit animal, ped etc... | ACCIDENT
| SEVERITY | non injury (0) - fatal (died in 30 days) (1) | ACCIDENT
| ROAD_GEOMETRY | Cross intersection - not at an intersection | ACCIDENT
| POLICE_ATTEND | whether police attended or not | ACCIDENT
| VEHICLE_ID | information of a vehicle | ACCIDENT_EVENT  
| ATMOSPH_COND_SEQ | not sure, if the weather changes maybe? | ATMOSPHERIC_COND
| COMPLEX_INT_NO | for breaking down complex intersections  into sub chunks | NODE_ID_COMPLEX_INT_ID
| LGA_NAME | local goverment area | NODE
| REGION_NAME | more granial area if needed | NODE
| NODE_TYPE | type of node point (intersection, non intersection etc...) | NODE
| TAKEN_HOSPITAL | is a person was taken to hospital | PERSON
| ROAD_USER_TYPE | where the person was in realtion to the accident | PERSON
| SURFACE_COND | road condition | ROAD_SURFACE_COND


## ideas for the project

1. Comparing intersection crash data year on year to see which intersections improve
2. Predict number of incidents using
    1. features
        1.  weather data
        2.  SURFACE_COND     
            1.  too related to weather data?
        3.  location data
            1.  node, LGA, Region?
        4. night or day
           1. might be too related to time?
        5. Datetime
           1. Time of day (1, 2, 4, or 6 hour bins)
           2. Day of week (or just weekday vs weekend etc...)
           3. month of year (maybe creates overfitting, too granular?)

    2. labels, what 
        1. count of Police attended
        2. Mean severity (or own severity metric)
           1. no_persons
           2. no_vehicles
           3. other NO_PERSON metrics
           4. number of people taken to hospitals
        3. straight count of incidents
    
    3.  to filter on
        1. accident_type: remove non colisions?
        2. SEVERITY: remove non injury?
        3. POLICE_ATTEND (not known category)
        4. unique ACCIDENT_NO
        5. check surface condition link to weather data

    1.  data sets needed
        1.  ROAD_SURFACE_COND
        2.  PERSON
        3.  NODE
        4.  ACCIDENT
        5.  ATMOSPHERIC_COND


# EXPERIMENTATION

## playing with the data to see what each table has to offer and what each column means

Note: most of the code here is meaningless, just reminents from experimenting

In [None]:
# compare columns
# list(set(pd.read_csv(ROOT_DIR + DATA_RAW_DIR + 'ACCIDENT\ACCIDENT_EVENT.csv').columns) - set(pd.read_csv(ROOT_DIR + DATA_RAW_DIR + 'ACCIDENT\ACCIDENT.csv').columns))


# comparing acciendent.csv from general to 2000_to_20005

len(list(set(pd.read_csv(ROOT_DIR + DATA_RAW_DIR + '2000_to_2005_ACCIDENT\ACCIDENT.csv')['ACCIDENT_NO']) - set(pd.read_csv(ROOT_DIR + DATA_RAW_DIR + 'ACCIDENT\ACCIDENT.csv')['ACCIDENT_NO'])))

In [None]:
# looking at any specific file
pd.read_csv(ROOT_DIR + DATA_RAW_DIR + "ACCIDENT/ACCIDENT_CHAINAGE.csv").iloc[0]
# pd.read_csv(ROOT_DIR + DATA_RAW_DIR + "ACCIDENT/ATMOSPHERIC_COND.csv").groupby('ATMOSPH_COND')['ATMOSPH_COND_SEQ'].unique()
# pd.read_csv(ROOT_DIR + DATA_RAW_DIR + "ACCIDENT/PERSON.csv").groupby('TAKEN_HOSPITAL').count()['ACCIDENT_NO']

In [None]:
# unique values inside atmospheric conditions
pd.read_csv(ROOT_DIR + DATA_RAW_DIR + "ACCIDENT/ATMOSPHERIC_COND.csv")['ATMOSPH_COND_SEQ'].unique()

In [None]:
# unique node ids
len(pd.read_csv(ROOT_DIR + DATA_RAW_DIR + "ACCIDENT/NODE.csv")['NODE_ID'].unique())
# len(pd.read_csv(ROOT_DIR + DATA_RAW_DIR + "ACCIDENT/ACCIDENT_LOCATION.csv")['NODE_ID'].unique())

# looking at the node file
pd.read_csv(ROOT_DIR + DATA_RAW_DIR + "ACCIDENT/NODE.csv")

In [None]:
# plotting node data using latitude and longitude 

# reading node data
local = pd.read_csv(ROOT_DIR + DATA_RAW_DIR + "2000_to_2005_ACCIDENT\\NODE.csv")
fig = px.scatter_mapbox(local, lat="Lat", lon="Long", hover_name="ACCIDENT_NO") 

fig.update_geos(resolution=50)
fig.update_layout( # Update the layout
)
fig.show() # Show the plot