# Preprocessing
Preprocessing is the previous step to Machine Learning, it consists on preparing the data to optimize the model algorithm. We will follow these steps:
- **Get dummy variables:** categorical values need to be separed in boolean columns.
- **Rescaling:** normalization of the numerical values in order to not imbalance the algorithm.

## Get dummy variables

In [1]:
# importing necessary libraries
import pandas as pd
import numpy as np

In [2]:
# importing the data
df = pd.read_csv('../data/clean_data.csv', index_col=0)

  mask |= (ar1 == a)


In [3]:
# exploring which are the columns that need to be dummies:
df.head()

Unnamed: 0,lon,lat,severity,num_vehicles,num_casualties,date,doy,time,road_type,Speed_limit,ped_crossing,light_cond,weather,road_cond,hazards,urb_or_rur,police_presence,year
0,-0.169101,51.493429,3,2,1,2012-01-19,5,21,Single carriageway,30,Pedestrian phase at traffic signal junction,Darkness: Street lights present and lit,Fine without high winds,Dry,,1,Yes,2012
1,-0.200838,51.517931,3,2,1,2012-04-01,4,17,Single carriageway,30,No physical crossing within 50 meters,Darkness: Street lights present and lit,Fine without high winds,Dry,,1,Yes,2012
2,-0.188636,51.487618,3,2,1,2012-10-01,3,10,One way street,30,non-junction pedestrian crossing,Daylight: Street light present,Fine without high winds,Dry,,1,Yes,2012
3,-0.200259,51.514325,3,1,1,2012-01-18,4,12,Single carriageway,30,No physical crossing within 50 meters,Daylight: Street light present,Fine without high winds,Dry,,1,Yes,2012
4,-0.183773,51.497614,3,1,1,2012-01-17,3,20,Single carriageway,30,No physical crossing within 50 meters,Darkness: Street lights present and lit,Fine without high winds,Dry,,1,Yes,2012


#### Police presence
This column, which tells us whether the police did attend the accident or not, has only two values: Yes or No. We will change these values to 0 and 1:
- 0 will mean no.
- 1 will mean yes.

In [4]:
# we will use the numpy where function
df['police_presence'] = np.where(df['police_presence'] == 'Yes', 1, 0)

#### Urban or rural
This columns is telling us if the accident happened in urban or rural.

In [5]:
# checking values
df['urb_or_rur'].value_counts()

1    968908
2    530565
Name: urb_or_rur, dtype: int64

We have two values: 1 for urban and 2 for rural. We will convert these into:
- 0 for urban.
- 1 for rural.

In [6]:
# we only need to subtract 1 from the column
df['urb_or_rur'] = df['urb_or_rur'] - 1

#### Hazards
The 'hazards' column tells us about risks and dangers the accident had.

In [7]:
# checking values
df['hazards'].value_counts()

None                                       1472338
Other object in carriageway                  11707
Any animal (except a ridden horse)            7998
Pedestrian in carriageway (not injured)       3563
Involvement with previous accident            2278
Dislodged vehicle load in carriageway         1589
Name: hazards, dtype: int64

We have 6 different hazards taking into account the 'none' one. These will be put in dummy variables, meaning that there'll be one column per hazard with a 1 in the rows with that hazard and a 0 in the ones without it. The hazard 'None' will be represented as a 0 in all the other columns.

In [8]:
# we will use the pandas get_dummies function
hazards = pd.get_dummies(df['hazards'], prefix='hazard')

# merging them
df1 = pd.merge(df, hazards, left_index=True, right_index=True)

# dropping the useless columns
df1.drop(['hazards', 'hazard_None'], axis=1, inplace=True)

#### Road condition
Tells us the conditions of the road, we have the following values:

In [9]:
df['road_cond'].value_counts()

Dry                          1032637
Wet/Damp                      422853
Frost/Ice                      31363
Snow                           10481
Flood (Over 3cm of water)       2139
Name: road_cond, dtype: int64

We'll get dummy variables from here, as we did for hazards. Also we'll drop the 'Dry' value because it means it has none of the other and the conditions are normal.

In [10]:
# we will use the pandas get_dummies function
road_cond = pd.get_dummies(df['road_cond'], prefix='road')

# merging them
df2 = pd.merge(df1, road_cond, left_index=True, right_index=True)

# dropping the useless columns
df2.drop(['road_cond', 'road_Dry'], axis=1, inplace=True)

#### Weather
Tells us weather conditions , we have the following values:

In [11]:
# checking values
df['weather'].value_counts()

Fine without high winds       1201497
Raining without high winds     177378
Other                           33389
Unknown                         26705
Raining with high winds         20774
Fine with high winds            18317
Snowing without high winds      11284
Fog or mist                      8173
Snowing with high winds          1956
Name: weather, dtype: int64

We'll get dummy variables from here, as we did for hazards. This time we won't drop values because we have the 'fine without high winds' as well as 'unknown' and 'other', which we may not exactly know what they mean but could give us information.

In [12]:
# we will use the pandas get_dummies function
weather = pd.get_dummies(df['weather'], prefix='weather')

# merging them
df3 = pd.merge(df2, weather, left_index=True, right_index=True)

# dropping the useless columns
df3.drop(['weather'], axis=1, inplace=True)

#### Light conditions
Tells us light conditions, we have the following values:

In [13]:
df['light_cond'].value_counts()

Daylight: Street light present               1098803
Darkness: Street lights present and lit       295466
Darkeness: No street lighting                  82383
Darkness: Street lighting unknown              15935
Darkness: Street lights present but unlit       6886
Name: light_cond, dtype: int64

We'll get dummy variables from here, as we did for hazards. Also we'll drop the 'Daylight' value because it means it has none of the other and the conditions are normal.

In [14]:
# we will use the pandas get_dummies function
light = pd.get_dummies(df['light_cond'])

# merging them
df4 = pd.merge(df3, light, left_index=True, right_index=True)

# dropping the useless columns
df4.drop(['light_cond', 'Daylight: Street light present'], axis=1, inplace=True)

#### Pedestrian crossing
We have the following values:

In [15]:
df['ped_crossing'].value_counts()

No physical crossing within 50 meters          1248826
Pedestrian phase at traffic signal junction      99898
non-junction pedestrian crossing                 78943
Zebra crossing                                   39945
Central refuge                                   27575
Footbridge or subway                              4286
Name: ped_crossing, dtype: int64

We'll get dummy variables from here, as we did for hazards. Also we'll drop the 'no physical crossing...' value because it means it has none of the other and the conditions are normal.

In [16]:
# we will use the pandas get_dummies function
peds = pd.get_dummies(df['ped_crossing'])

# merging them
df5 = pd.merge(df4, peds, left_index=True, right_index=True)

# dropping the useless columns
df5.drop(['ped_crossing', 'No physical crossing within 50 meters'], axis=1, inplace=True)

#### Road types
We have the following values:

In [17]:
df['road_type'].value_counts()

Single carriageway    1123476
Dual carriageway       221243
Roundabout             100006
One way street          30825
Slip road               15632
Unknown                  8291
Name: road_type, dtype: int64

We'll get dummy variables from here, as we did for hazards.

In [19]:
# we will use the pandas get_dummies function
rtypes = pd.get_dummies(df['road_type'], prefix='rtype')

# merging them
df6 = pd.merge(df5, rtypes, left_index=True, right_index=True)

# dropping the useless columns
df6.drop(['road_type'], axis=1, inplace=True)

#### Day of the week
We have the following values:

In [20]:
df['doy'].value_counts()

6    246366
5    225712
4    225619
3    223407
2    213064
7    200803
1    164502
Name: doy, dtype: int64

We will divide the days of the week in weekend (Fri, Sat and Sun) and rest of the week, (Mon, Tue, Wed, Thu).

In [21]:
# we will use np.where for this purpose
df6['weekend'] = np.where((df.doy == 6) | (df.doy == 7) | (df.doy == 1), 1, 0)

In [22]:
df6.weekend.isna()

0          False
1          False
2          False
3          False
4          False
           ...  
1504145    False
1504146    False
1504147    False
1504148    False
1504149    False
Name: weekend, Length: 1499473, dtype: bool

#### Time
We have 24 values indicating the our of the accident, we want to convert them into day and night.
We'll create a column named night that will contain 0 if it's between 7 and 22 o'clock and 1 for the rest, considering it the night.

In [23]:
df6['night'] = np.where((df['time'] >= 7) & (df['time'] <= 22), 0, 1)
df6.drop(['time'], axis=1, inplace=True)

#### Now that we've finished converting dummy variables, we'll drop columns that don't give relevant information for our analysis.

In [24]:
df6.drop(['lat','lon','date', 'year'], axis=1, inplace=True)

In [25]:
df6.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 1499473 entries, 0 to 1504149
Data columns (total 42 columns):
severity                                          1499473 non-null int64
num_vehicles                                      1499473 non-null int64
num_casualties                                    1499473 non-null int64
doy                                               1499473 non-null int64
Speed_limit                                       1499473 non-null int64
urb_or_rur                                        1499473 non-null int64
police_presence                                   1499473 non-null int64
hazard_Any animal (except a ridden horse)         1499473 non-null uint8
hazard_Dislodged vehicle load in carriageway      1499473 non-null uint8
hazard_Involvement with previous accident         1499473 non-null uint8
hazard_Other object in carriageway                1499473 non-null uint8
hazard_Pedestrian in carriageway (not injured)    1499473 non-null uint8
road_Flood 

## Rescaling 
Rescaling of the numerical values in order to not imbalance the algorithm.

In [26]:
# we've tried some rescaling methods from sklearn to get the data between 0 and 1, but none of them worked correctly
# we'll divide the speed limit by the max we have

df6['Speed_limit'] = df6['Speed_limit']/df6['Speed_limit'].max()

In [27]:
# checking everything looks okay
df6.head()

Unnamed: 0,severity,num_vehicles,num_casualties,doy,Speed_limit,urb_or_rur,police_presence,hazard_Any animal (except a ridden horse),hazard_Dislodged vehicle load in carriageway,hazard_Involvement with previous accident,...,Zebra crossing,non-junction pedestrian crossing,rtype_Dual carriageway,rtype_One way street,rtype_Roundabout,rtype_Single carriageway,rtype_Slip road,rtype_Unknown,weekend,night
0,3,2,1,5,0.428571,0,1,0,0,0,...,0,0,0,0,0,1,0,0,0,0
1,3,2,1,4,0.428571,0,1,0,0,0,...,0,0,0,0,0,1,0,0,0,0
2,3,2,1,3,0.428571,0,1,0,0,0,...,0,1,0,1,0,0,0,0,0,0
3,3,1,1,4,0.428571,0,1,0,0,0,...,0,0,0,0,0,1,0,0,0,0
4,3,1,1,3,0.428571,0,1,0,0,0,...,0,0,0,0,0,1,0,0,0,0


In [33]:
df6.to_csv('../data/preprocessed_data.csv')