### Data exploration

## Weather dataset preprocessing

Since in the process of gathering the data we selected all possible attributes, now it's time to delete the ones that are completely empty or are not useful for our purpose. Then, other preprocessing steps will be applied.

In [112]:
import os
import re
import geopy as gp
import numpy as np
import pandas as pd
import altair as alt
import geopandas as gpd
from uszipcode import SearchEngine
from Modules import collision_preprocessing as cp

In [113]:
dir = './Data'
temp_pre = './Data/tmp_pre'
colission_exists = False

In [114]:
weather = pd.read_csv(f'{dir}/weather.csv')
weather.shape

  weather = pd.read_csv(f'{dir}/weather.csv')


(90591, 66)

In [115]:
weather.head()

Unnamed: 0,STATION,NAME,LATITUDE,LONGITUDE,ELEVATION,DATE,AWND,AWND_ATTRIBUTES,DAPR,DAPR_ATTRIBUTES,...,WT05,WT05_ATTRIBUTES,WT06,WT06_ATTRIBUTES,WT08,WT08_ATTRIBUTES,WT09,WT09_ATTRIBUTES,WT11,WT11_ATTRIBUTES
0,USC00280907,"BOONTON 1 SE, NJ US",40.89174,-74.39635,85.3,2018-01-01,,,,,...,,,,,,,,,,
1,USC00280907,"BOONTON 1 SE, NJ US",40.89174,-74.39635,85.3,2018-01-02,,,,,...,,,,,,,,,,
2,USC00280907,"BOONTON 1 SE, NJ US",40.89174,-74.39635,85.3,2018-01-03,,,,,...,,,,,,,,,,
3,USC00280907,"BOONTON 1 SE, NJ US",40.89174,-74.39635,85.3,2018-01-04,,,,,...,,,,,,,,,,
4,USC00280907,"BOONTON 1 SE, NJ US",40.89174,-74.39635,85.3,2018-01-05,,,,,...,,,,,,,,,,


### Data selection

In [116]:
weather = weather[((weather['DATE'] >= '2018-06-01') & (weather['DATE'] <= '2018-09-30')) | ((weather['DATE'] >= '2020-06-01') & (weather['DATE'] <= '2020-09-30'))]

weather.to_csv(f'{dir}/weather_2018-2020.csv', index=False)

In [117]:
weather.shape

(20536, 66)

With a simple view we see a lot of very sparse attributes, and a lot of attributes that are not really useful. To deal with these attributes, the best option is to define the useful ones and erase all the others.

First of all, we will start by joining *LATITUDE* and *LONGITUDE* in a *LOCATION* attribute following the format of the previous dataset. We will delete the *ELEVATION* since it can't be related in any way with the previous dataset and the information it gives is not useful for our purpose.

In [118]:
weather['LOCATION'] = '(' + weather['LATITUDE'].astype(str) + ', ' + weather['LONGITUDE'].astype(str) + ')'
weather.drop(columns=['LATITUDE', 'LONGITUDE', 'ELEVATION'], inplace=True)

Since we have *LOCATION*, we don't need *STATION* code or *NAME*.

In [119]:
weather.drop(columns=['STATION', 'NAME'], inplace=True)

Now, with regard to the other columns, let's follow the proposed startegy. We are interested in *LOCATION* so we can know about the weather condtitions of the zone where the collision was produced. For the weather attributes, following the *Documentation* of the weather dataset, provided by the *NOAA* and available in the Documentation directory, we are (for now) interested in:

- *PRCP* : Precipitation (mm)
- *SNOW* : Snowfall (mm)
- *SNWD* : Snow depth (mm)
- *TMAX* : Maximum temperature (Celsius)
- *TMIN* : Minimum temperature (Celsius)
- *TOBS* : Temperature at the time of observation
- *AWND* : Average daily wind speed (meters per second)

In [120]:
weather['WT09_ATTRIBUTES'].unique()

array([nan], dtype=object)

In [121]:
select_col = ['DATE',
              'PRCP',
              'SNOW',
              'SNWD',
              'TMAX',
              'TMIN',
              'TOBS',
              'AWND',
              'WT01',
              'WT02',
              'WT03',
              'WT04',
              'WT05',
              'WT06',
              'WT08',
              'WT09',
              'WT11',
              'LOCATION']

weather = weather[select_col]

# change 'LOCATION' column to the second column
cols = list(weather.columns)
cols = [cols[-1]] + cols[:-1]
weather = weather[cols]

In [122]:
weather.columns

Index(['LOCATION', 'DATE', 'PRCP', 'SNOW', 'SNWD', 'TMAX', 'TMIN', 'TOBS',
       'AWND', 'WT01', 'WT02', 'WT03', 'WT04', 'WT05', 'WT06', 'WT08', 'WT09',
       'WT11'],
      dtype='object')

In [None]:
weather[weather['WT01'].isnull() & weather['WT02'].isnull() & weather['WT03'].isnull() & weather['WT04'].isnull() & weather['WT05'].isnull() & weather['WT06'].isnull() & weather['WT08'].isnull() & weather['WT09'].isnull() & weather['WT11'].isnull()].shape

(19894, 18)

In [None]:
# todas las filas nulas pueden ser que el dia haya sido bonito
# WT** son binarios, si es nulo es que no paso nada
# si hay alguno de los WT que no es nulo, entonces el dia no fue bonito

In [None]:
weather['WT01'].unique(), weather['WT02'].unique(), weather['WT03'].unique(), weather['WT04'].unique(), weather['WT05'].unique(), weather['WT06'].unique(), weather['WT08'].unique(), weather['WT09'].unique(), weather['WT11'].unique()

(array([nan,  1.]),
 array([nan,  1.]),
 array([nan,  1.]),
 array([nan]),
 array([nan,  1.]),
 array([nan]),
 array([nan,  1.]),
 array([nan]),
 array([nan,  1.]))

In [None]:
weather.columns

Index(['LOCATION', 'DATE', 'PRCP', 'SNOW', 'SNWD', 'TMAX', 'TMIN', 'TOBS',
       'AWND', 'WT01', 'WT02', 'WT03', 'WT04', 'WT05', 'WT06', 'WT08', 'WT09',
       'WT11'],
      dtype='object')

In [None]:
wt_columns = ['WT01', 'WT02', 'WT03', 'WT04', 'WT05', 'WT06', 'WT08', 'WT09', 'WT11']

weather['ADVERSE_CONDITION'] = weather[wt_columns].isnull().all(axis=1).astype(int)
weather['ADVERSE_CONDITION'] = 1 - weather['ADVERSE_CONDITION']

weather['ADVERSE_CONDITION'].value_counts()

ADVERSE_CONDITION
0    19894
1      642
Name: count, dtype: int64

In [None]:
weather.drop(columns=wt_columns, inplace=True)
weather.head()

Unnamed: 0,LOCATION,DATE,PRCP,SNOW,SNWD,TMAX,TMIN,TOBS,AWND,ADVERSE_CONDITION
151,"(40.89174, -74.39635)",2018-06-01,1.8,0.0,0.0,20.6,16.7,18.9,,0
152,"(40.89174, -74.39635)",2018-06-02,5.3,0.0,0.0,28.9,18.9,21.7,,0
153,"(40.89174, -74.39635)",2018-06-03,14.0,0.0,0.0,29.4,14.4,15.0,,0
154,"(40.89174, -74.39635)",2018-06-04,16.8,0.0,0.0,18.9,10.6,11.1,,0
155,"(40.89174, -74.39635)",2018-06-05,0.0,0.0,0.0,23.3,11.1,12.2,,0


### Missing values

In [130]:
weather.isnull().sum()

LOCATION                 0
DATE                     0
PRCP                   297
SNOW                 11314
SNWD                 16813
TMAX                 17221
TMIN                 17233
TOBS                 19244
AWND                 18608
ADVERSE_CONDITION        0
dtype: int64