# Preprocessing
Preprocessing is the previous step to Machine Learning, it consists on preparing the data to optimize the model algorithm. We will follow these steps:
- **Get dummy variables:** categorical values need to be separed in boolean columns.
- **Rescaling:** normalization or standarisation of the numerical values in order to not imbalance the algorithm.

### Get dummy variables

In [15]:
# importing necessary libraries
import pandas as pd
import numpy as np

In [29]:
# importing the data
df = pd.read_csv('../data/clean_data.csv')

In [30]:
# exploring which are the columns that need to be dummies:
df.head()

Unnamed: 0.1,Unnamed: 0,lon,lat,severity,num_vehicles,num_casualties,date,doy,time,road_type,Speed_limit,ped_crossing,light_cond,weather,road_cond,hazards,urb_or_rur,police_presence,year
0,0,-0.169101,51.493429,3,2,1,2012-01-19,5,21,Single carriageway,30,Pedestrian phase at traffic signal junction,Darkness: Street lights present and lit,Fine without high winds,Dry,,1,Yes,2012
1,1,-0.200838,51.517931,3,2,1,2012-04-01,4,17,Single carriageway,30,No physical crossing within 50 meters,Darkness: Street lights present and lit,Fine without high winds,Dry,,1,Yes,2012
2,2,-0.188636,51.487618,3,2,1,2012-10-01,3,10,One way street,30,non-junction pedestrian crossing,Daylight: Street light present,Fine without high winds,Dry,,1,Yes,2012
3,3,-0.200259,51.514325,3,1,1,2012-01-18,4,12,Single carriageway,30,No physical crossing within 50 meters,Daylight: Street light present,Fine without high winds,Dry,,1,Yes,2012
4,4,-0.183773,51.497614,3,1,1,2012-01-17,3,20,Single carriageway,30,No physical crossing within 50 meters,Darkness: Street lights present and lit,Fine without high winds,Dry,,1,Yes,2012


#### Police presence
This column, which tells us whether the police did attend the accident or not, has only two values: Yes or No. We will change these values to 0 and 1:
- 0 will mean no.
- 1 will mean yes.

In [31]:
# we will use the numpy where function
df['police_presence'] = np.where(df['police_presence'] == 'Yes', 1, 0)

#### Urban or rural
This columns is telling us if the accident happened in urban or rural.

In [41]:
# checking values
df['urb_or_rur'].value_counts()

1    968908
2    530565
3        35
Name: urb_or_rur, dtype: int64

We have two values: 1 for urban and 2 for rural. We will convert these into:
- 0 for urban.
- 1 for rural.

In [43]:
# we only need to subtract 1 from the column
df['urb_or_rur'] = df['urb_or_rur'] - 1

#### Hazards
The 'hazards' column tells us about risks and dangers the accident had.

In [32]:
# checking values
df['hazards'].value_counts()

None                                       1472373
Other object in carriageway                  11707
Any animal (except a ridden horse)            7998
Pedestrian in carriageway (not injured)       3563
Involvement with previous accident            2278
Dislodged vehicle load in carriageway         1589
Name: hazards, dtype: int64

We have 6 different hazards taking into account the 'none' one. These will be put in dummy variables, meaning that there'll be one column per hazard with a 1 in the rows with that hazard and a 0 in the ones without it. The hazard 'None' will be represented as a 0 in all the other columns.

In [40]:
# we will use the pandas get_dummies function
hazards = pd.get_dummies(df['hazards'])

# merging them
df1 = pd.merge(df, hazards, left_index=True, right_index=True)

# dropping the useless columns
df1.drop(['Unnamed: 0', 'hazards', 'None'], axis=1, inplace=True)
