# Predicting Severity of Car Accident

# Table of Content  
[1. Introduction: Problem Understanding](#problem)  
[2. Data](#data)  
> [2.1. Data Understanding](#datUnd)  
  [2.2. Data Wrangling](#datWrang)

[3. Methodolgy](#method)  
> [3.1. Exploratory Data Analysis](#exp)  
  [3.2. Feature Selection and Feature Engineering](#feature)  
  [3.3. Modeling](#model)  
  [3.4. Model Evaluation](#eval)  
  
[4. Results and Discussion](#rd)  
[5. Conclusion](#conc)  

# 1. Introduction: Problem Understanding <a class="anchor" id="problem"></a>

Would it not be great if you could tell in advance how sever an accident will be if it occurs?! In this project, we will use a car-accident data of the in the <font color="red"> Seatle city  in the Washington State, US </font> from 2004 to 2013. The data

At the end of this project, you should be able to tell how sever a collision is based on some attributes of the collision such as the street at which it occurs, weather condition, light condition, whether the driver was under the influence of alcohol or drug, etc.

# 2. Data <a class="anchor" id="data"></a>

## 2.1. Data Understanding <a class="anchor" id="datUnd"></a>

The data contains information about the car accidents occured in the Seattle city in the US  The data contains 221144 records and 40 attributes (features) which include the severity, coordinates, location of the accident, weather and light condition, wether the accident is due to speeding or under the influence of of alcohol. The data has a public open access and can be downloaded from the [city goverment website](https://data.seattle.gov/).

The features of the data are summarized in the following table:  

| Feature/Attribute | Description |
| :- | :- |
| SEVERITYCODE | A code that corresponds to the severity of the collision. **This is the target label** |
| X | The x-coordinate of the exact location of the collision |
| Y | The y-coordinate of the exact location of the collision |
| ObjectID | Unique identifier |
| INCKEY | A unique key for the incident |
| COLDETKEY | Secondary key for the incident |
| STATUS | NA |
| ADDRTYPE | Collision address type which can be _alley, block_ or _intersection_ |
| INTKEY | Key that corresponds to the intersection associated with a collision |
| LOCATION | Description of the general location of the collision |
| EXCEPTRSNCODE | NA |
| EXCEPTRSNDESC | NA |
| SEVERITYDESC | A detailed description of the severity of the collision |
| COLLISIONTYPE | Collision type |
| PERSONCOUNT | The total number of people involved in the collision |
| FATALITIES | The number of fatalities in the collision. This is entered by the state |
| INCDATE | The date of the incident |
| INCDTTM | The date and time of the incident |
| JUNCTIONTYPE | Category of junction at which collision took place |
| SDOT_COLCODE | A code given to the collision by SDOT |
| SDOT_COLDESC | A description of the collision corresponding to the collision code |
| INATTENTIONIND | Whether or not collision was due to inattention |
| UNDERINFL | Whether or not a driver involved was under the influence of drugs or alcohol |
| WEATHER | A description of the weather conditions during the time of the collision |
| ROADCOND | The condition of the road during the collision |
| LIGHTCOND | The light conditions during the collision |
| PEDROWNOTGRNT | Whether or not the pedestrian right of way was not granted |
| SDOTCOLNUM | A number given to the collision by SDOT |
| SPEEDING | Whether or not speeding was a factor in the collision |
| ST_COLCODE | A code provided by the state that describes the collision |
| ST_COLDESC | A description that corresponds to the state’s coding designation |
| SEGLANEKEY | A key for the lane segment in which the collision occurred |
| CROSSWALKKEY | A key for the crosswalk at which the collision occurred |
| HITPARKEDCAR | Whether or not the collision involved hitting a parked car |

**Important Notes:**  
- The target label `SEVERITYCODE` has multiple values. Therefore, we have a <font color="red"> multiclass classification </font> problem.
- The target label takes the following values:

| Value | Description |
| :- | :- |
| 0 | Unknown |
| 1 | Property damage |
| 2 | Injury |
| 2b | Serious Injury |
| 3 | Fatality |

- The values of the target label are not equally represented in the dataset. Therefore, we have <font color="red"> imbalance classification </font> problem.  
- Not of all attributes are relevant for modeling such as various keys that are probably used to archiving and administrative purposes such as `OBJECTID` (a key for uniquely identifying collision), `REPORTNO` (a key for uniquely identifying the report of the collision), etc. These irrelevant attributes should be removed.
- The remaining attributes, though relevant, correlate with different degrees to the severity of the accident and thus, we still perform feature selection to select the most informative features and to prevent overfitting.

In [15]:
import pandas as pd
import numpy as np

In [16]:
# Read data
df = pd.read_csv("/home/ahmed/Courses/Data_Science/Capstone/Collisions.csv")

How many features and observations we have:

In [17]:
print('The number of observation = %d and number of features = %d' %(df.shape))

The number of observation = 221144 and number of features = 40


From the table above, it can be observed that not all features are relevant for the machine learning model. These irrelevant features include several unique keys (OBJECTID, SDOT_COLCODE etc), the report number (REPORTNO), and description of the severity of the collision (SEVERITYDESC), etc.  

Please note that not all keys should be removed since some keys are <font color="blue"> informative </font> such as the intersection key (INTKEY) and lane key (SEGLANEKEY) which identify the intersection and the lane at which the collision occurs respectively.

In [18]:
df.drop(labels=['OBJECTID', 'EXCEPTRSNCODE', 'EXCEPTRSNDESC', 'INCDATE', 'INCDTTM', 'SDOT_COLCODE','SDOT_COLDESC',\
                'SDOTCOLNUM', 'ST_COLDESC', 'HITPARKEDCAR', 'INCKEY', 'COLDETKEY', 'REPORTNO', 'SEVERITYDESC',\
               'ST_COLCODE'], axis=1, inplace=True)
# see how many attributes remaining
print('The number of possibly relevant feature = %d' %(df.shape[1]))

The number of possibly relevant feature = 25


In [19]:
# Show the feature names
df.columns.values

array(['X', 'Y', 'STATUS', 'ADDRTYPE', 'INTKEY', 'LOCATION',
       'SEVERITYCODE', 'COLLISIONTYPE', 'PERSONCOUNT', 'PEDCOUNT',
       'PEDCYLCOUNT', 'VEHCOUNT', 'INJURIES', 'SERIOUSINJURIES',
       'FATALITIES', 'JUNCTIONTYPE', 'INATTENTIONIND', 'UNDERINFL',
       'WEATHER', 'ROADCOND', 'LIGHTCOND', 'PEDROWNOTGRNT', 'SPEEDING',
       'SEGLANEKEY', 'CROSSWALKKEY'], dtype=object)

**How many classes and what is the distribution of these classes**

In [20]:
df.SEVERITYCODE.value_counts()

1     137414
2      58665
0      21619
2b      3096
3        349
Name: SEVERITYCODE, dtype: int64

<font color="blue"> As can be observed the classes are not equally represented in the data set and thus, we have an </font> <font color="red"> imbalanced classification. </font>. We will deal with this later.

## 2.2. Data Wrangling <a class="anchor" id="datWrang"></a>

### Handling Missing Values


Missing values are handled as follows:
- Dropping a feature (dropping the entire column)  
This is done when the missing values are too many (above 50%). In this case, imputing values will likely generate inaccurate data that does not represent the real distribution of the corresponding feature.
- Droping a record (dropping the entire row)  
This is done for missing values of the target label.
- Replacing missing values with a roulette wheel selection  
For categorical features, the common practice is to replace the missing values with the most frequent one. However, I come up with a better way of replacing the missing values that is more likely to respect the underlined, unknown distribution from which the data was drawn. This is done by a **roulette wheel selection** in which the probability of a non-missing value to replace a missing value is proportional to the frequency of the non-missing value.
- Replace the missing value with the mean  
Write sth.

**Handling missing values of the target label**

In [21]:
# check if the target label has missing values
df.SEVERITYCODE.isnull().sum()

1

In [22]:
# remove the missing value from the target label
df.dropna(subset=['SEVERITYCODE'], axis=0, inplace=True)
# reset the indexes
df.reset_index(drop=True, inplace=True)
# check again
df.SEVERITYCODE.isnull().sum()

0

In [23]:
def displayna(dataFrame, threshold=50, showall=False):
    """
    Print the columns that has missing values above a threshold. If showall is True, all columns will be displayed
    regardless of the threshold.
    """
    missing_data = dataFrame.isnull()
    for column in missing_data.columns.values.tolist():    
        series = missing_data[column].value_counts()
        indexes = missing_data[column].value_counts().index.values.tolist()
        printed = False
        for idx in indexes:
            if showall or (idx and np.round(100*series.loc[idx]/dataFrame.shape[0]) > threshold):
                if(not printed): print("Column Name: ", column)
                print("Missing Value? ", idx, " Percentage: ", np.round(100*series.loc[idx]/dataFrame.shape[0]))
                printed = True
        if printed: print("")  

Handling columns with too many missing values (above 50%)

In [24]:
displayna(df)

Column Name:  INTKEY
Missing Value?  True  Percentage:  68.0

Column Name:  INATTENTIONIND
Missing Value?  True  Percentage:  86.0

Column Name:  PEDROWNOTGRNT
Missing Value?  True  Percentage:  98.0

Column Name:  SPEEDING
Missing Value?  True  Percentage:  96.0



In [25]:
# drop those columns
df.drop(labels=['INTKEY', 'INATTENTIONIND', 'PEDROWNOTGRNT', 'SPEEDING'], axis=1, inplace=True)
# check
print("The number of remaining features = ", df.shape[1])

The number of remaining features =  21


Check if there is any column with some missing values

In [26]:
displayna(df, showall=True)

Column Name:  X
Missing Value?  False  Percentage:  97.0
Missing Value?  True  Percentage:  3.0

Column Name:  Y
Missing Value?  False  Percentage:  97.0
Missing Value?  True  Percentage:  3.0

Column Name:  STATUS
Missing Value?  False  Percentage:  100.0

Column Name:  ADDRTYPE
Missing Value?  False  Percentage:  98.0
Missing Value?  True  Percentage:  2.0

Column Name:  LOCATION
Missing Value?  False  Percentage:  98.0
Missing Value?  True  Percentage:  2.0

Column Name:  SEVERITYCODE
Missing Value?  False  Percentage:  100.0

Column Name:  COLLISIONTYPE
Missing Value?  False  Percentage:  88.0
Missing Value?  True  Percentage:  12.0

Column Name:  PERSONCOUNT
Missing Value?  False  Percentage:  100.0

Column Name:  PEDCOUNT
Missing Value?  False  Percentage:  100.0

Column Name:  PEDCYLCOUNT
Missing Value?  False  Percentage:  100.0

Column Name:  VEHCOUNT
Missing Value?  False  Percentage:  100.0

Column Name:  INJURIES
Missing Value?  False  Percentage:  100.0

Column Name:  SERI

**Consider X and Y**

**Important note:**  

Based on the result above, we can conclude that:  
- Replacing a missing value with the mean does not make sense <font color="red"> since the mean in this case can corrspond to a location in which it is impossible to have a car accident</font> such as a lake, stream, etc. 
- There is no dominant values for X and Y and the likelihood of having an accident at a particular (X,Y)-location is very small. Therefore, it makes sense to replace the missing values for X and Y by a randomly selected values from X and Y. 

Define a little function for this purpose

In [27]:
def random_replacena(dataFrame, col):
    """
    Replace missing values with random values selected from the same series.
    """
    # Find the non-missing values
    series = dataFrame[pd.notna(dataFrame[col])][col]
    # Replace missing values at random
    dataFrame[col] = dataFrame[col].apply(lambda x: np.random.choice(series) if (pd.isna(x)) else x)
    # Check if there is no missing value
    if dataFrame[col].isna().sum() != 0:
        print('Error!')
    else:
        print('Successful!')

In [28]:
random_replacena(df, 'X')
random_replacena(df, 'Y')

Successful!
Successful!


**Handling missing values of the address type (ADDRTYPE)**

In [29]:
df.ADDRTYPE.value_counts()

Block           144784
Intersection     71773
Alley              874
Name: ADDRTYPE, dtype: int64

Replace the missing values with random values selected according to <font color="blue"> roulette wheel selection </font>. To do this, define a function for calculating the probability distribution:

In [30]:
def find_probs(data_frame, col):
    """
    Find the probability distribution of data_frame.col
    """
    counts = data_frame[col].value_counts().values
    total = sum(counts)
    probs = [float(x)/total for x in counts]
    return probs


def roulette_wheel_replacena(data_frame, col, show=False):
    probs = find_probs(data_frame, col)
    if show:
        print("Probability distribution BEFORE applying roulette wheel:")
        print(probs)
    values = data_frame[col].value_counts().index.values.tolist()
    # Replace missing values using roulette wheel
    data_frame[col] = data_frame[col].apply(lambda x: np.random.choice(values, p=probs) if (pd.isna(x)) else x)
    # Check if there is no missing value
    if show:
        print("")
        print("Probability distribution AFTER applying roulette wheel:")
        print(find_probs(data_frame, col)) 
        print("")
    if data_frame[col].isna().sum() != 0:
        print('Error!')
    else:
        print('Successful!')    
    

In [31]:
roulette_wheel_replacena(df, 'ADDRTYPE', show=True)

Probability distribution BEFORE applying roulette wheel:
[0.6658848094337974, 0.33009552455721586, 0.004019666008986759]

Probability distribution AFTER applying roulette wheel:
[0.6659175284770488, 0.3300714922018784, 0.00401097932107279]

Successful!


**Handling missing values of location (LOCATION)**  

This is a bit complicated.  
- The attribute `LOCATION` provides a description of the general location at which the collision occured.  
- Most of the data entries are of the form: <font color="blue"> Street A BETWEEN Street B and Street C </font>. 
- In reality, some streets are more prone to accident than others. For instance, a main busy road in the city is more likely to have accidents compared to a quiet road in a quiet neighborhood. 
- Furthermore, the _specific location_ along the street is usually of a secondary importance. This is confirmed by the data from the `value_counts` since there is no dominantly frequent values.  
- However, when the streets are considered as a whole _without partitioning the street_ as in the data, some streets appear to have significantly more accidents.

In [32]:
df.LOCATION.value_counts()

BATTERY ST TUNNEL NB BETWEEN ALASKAN WY VI NB AND AURORA AVE N    298
N NORTHGATE WAY BETWEEN MERIDIAN AVE N AND CORLISS AVE N          296
BATTERY ST TUNNEL SB BETWEEN AURORA AVE N AND ALASKAN WY VI SB    291
AURORA AVE N BETWEEN N 117TH PL AND N 125TH ST                    282
6TH AVE AND JAMES ST                                              276
                                                                 ... 
24TH AVE NE AND NE 60TH ST                                          1
47TH AVE S BETWEEN S ADAMS ST AND S GENESEE ST                      1
47TH AVE S AND S CLOVERDALE ST                                      1
8TH AVE S BETWEEN S GARDEN ST AND S OTHELLO ST                      1
WALNUT AVE SW BETWEEN SW STEVENS ST AND SW WINTHROP ST              1
Name: LOCATION, Length: 25187, dtype: int64

Clearly there is no dominantly frequent location. Now, remove the street partition and <font color="blue"> a street from its beginning to its end is considered as a _single entry_ </font>.

In [33]:
df['LOCATION'] = df['LOCATION'].apply(lambda x: (str(x).split('BETWEEN')[0]).strip() if "BETWEEN" in str(x) else x)

In [34]:
df.LOCATION.value_counts()[:20]

RAINIER AVE S           5712
AURORA AVE N            3896
1ST AVE S               1600
LAKE CITY WAY NE        1553
1ST AVE                 1549
4TH AVE S               1413
DELRIDGE WAY SW         1383
EAST MARGINAL WAY S     1375
GREENWOOD AVE N         1251
NE 45TH ST              1175
CALIFORNIA AVE SW       1113
15TH AVE NW             1037
ROOSEVELT WAY NE        1037
35TH AVE SW             1020
M L KING JR ER WAY S     993
15TH AVE NE              980
DENNY WAY                925
E MADISON ST             913
4TH AVE                  869
M L KING JR WR WAY S     863
Name: LOCATION, dtype: int64

Eureka! Some streets appear to have significantly more accidents compared to other streets. Those streets are among the 10 busiest streets in the Seattle city according to [this website](#https://mynorthwest.com/gallery/seattles-10-busiest-streets/). This pattern was not clearly apparent before.

In [51]:
# Now replace the missing values
roulette_wheel_replacena(df, 'LOCATION')

Successful!


**Handling missing values of the collision type (COLLISIONTYPE)**

In [35]:
df.COLLISIONTYPE.value_counts()

Parked Car    48451
Angles        35460
Rear Ended    34622
Other         24524
Sideswipe     18853
Left Turn     14080
Pedestrian     7653
Cycles         5909
Right Turn     3007
Head On        2181
Name: COLLISIONTYPE, dtype: int64

In [36]:
roulette_wheel_replacena(df, 'COLLISIONTYPE', show=True)

Probability distribution BEFORE applying roulette wheel:
[0.24879839786381844, 0.18208893909828489, 0.17778576563623294, 0.12593201191332032, 0.0968111327924412, 0.07230153024545548, 0.03929855191537435, 0.03034302146451679, 0.015441100955119648, 0.011199548115435966]

Probability distribution AFTER applying roulette wheel:
[0.24859027868845046, 0.18253799577648852, 0.1777582831018843, 0.12624410449347254, 0.09651673351632202, 0.07194891992963828, 0.03916922534287769, 0.03042375295623194, 0.015537457663141045, 0.011273248531493196]

Successful!


**Handling missing values of junction type (JUNCTIONTYPE)**

In [37]:
df.JUNCTIONTYPE.value_counts()

Mid-Block (not related to intersection)              101523
At Intersection (intersection related)                69067
Mid-Block (but intersection related)                  24392
Driveway Junction                                     11493
At Intersection (but not related to intersection)      2495
Ramp Junction                                           190
Unknown                                                  21
Name: JUNCTIONTYPE, dtype: int64

Use roulette wheel to replace the missing values

In [38]:
roulette_wheel_replacena(df, 'JUNCTIONTYPE', show=True)

Probability distribution BEFORE applying roulette wheel:
[0.48533566624119784, 0.33017817105760083, 0.11660714883282898, 0.05494284853786912, 0.011927469512049373, 0.0009083042914987499, 0.00010039152695512498]

Probability distribution AFTER applying roulette wheel:
[0.4854325029505795, 0.3301574094590378, 0.11646310306001094, 0.05492825909027191, 0.012005806197799614, 0.0009179580633345844, 9.496117896564667e-05]

Successful!


**Handling missing values of whether the driver was under the influence of alcohol (UNDERINFL)**

In [39]:
df.UNDERINFL.value_counts()

N    103454
0     81676
Y      5399
1      4230
Name: UNDERINFL, dtype: int64

According to the data source, the attribute UNDERINFL is binary which has two value: "Y" to indicate that the driver was under the influence of alcohol and "N" for negating that. 
It appears that the data entries are not consistent. <font color="red"> We need to convert "Y" to 1 and "N" to 0 </font>

In [40]:
df['UNDERINFL'] = df['UNDERINFL'].replace(to_replace={"N": "0", "Y":"1"})
# Check 
df.UNDERINFL.value_counts()

0    185130
1      9629
Name: UNDERINFL, dtype: int64

In [50]:
roulette_wheel_replacena(df, 'UNDERINFL', show=True)

Probability distribution BEFORE applying roulette wheel:
[0.9505594093212637, 0.04944059067873628]

Probability distribution AFTER applying roulette wheel:
[0.95048452811077, 0.04951547188923005]

Successful!


**Handling missing values of weather (WEATHER)**

In [41]:
df.WEATHER.value_counts()

Clear                       114342
Raining                      34019
Overcast                     28503
Unknown                      15131
Snowing                        919
Other                          851
Fog/Smog/Smoke                 577
Sleet/Hail/Freezing Rain       116
Blowing Sand/Dirt               56
Severe Crosswind                26
Partly Cloudy                    9
Blowing Snow                     1
Name: WEATHER, dtype: int64

In [42]:
roulette_wheel_replacena(df, 'WEATHER', show=True)

Probability distribution BEFORE applying roulette wheel:
[0.5877255204317656, 0.17485993317913132, 0.14650732459521973, 0.07777435106656387, 0.004723721408378309, 0.00437419686455924, 0.0029658185556412234, 0.0005962477512207659, 0.00028784374196864557, 0.00013364173734258544, 4.6260601387818045e-05, 5.140066820868671e-06]

Probability distribution AFTER applying roulette wheel:
[0.5873168040589121, 0.1751852873480056, 0.14682807052450225, 0.0775787612540302, 0.00464857580841356, 0.004426999724160385, 0.0029483185088381727, 0.000574289034697006, 0.0003120153023156962, 0.00013113686619065491, 4.5219609031260315e-05, 4.521960903126032e-06]

Successful!


**Handling missing values of road condition (ROADCOND)**

In [43]:
df.ROADCOND.value_counts()

Dry               128150
Wet                48710
Unknown            15129
Ice                 1231
Snow/Slush          1014
Other                136
Standing Water       119
Sand/Mud/Dirt         77
Oil                   64
Name: ROADCOND, dtype: int64

In [44]:
roulette_wheel_replacena(df, 'ROADCOND', show=True)

Probability distribution BEFORE applying roulette wheel:
[0.658428813646406, 0.25026974258850126, 0.07773210707496275, 0.006324821456096182, 0.005209885423624313, 0.0006987617530699276, 0.0006114165339361866, 0.00039562246313517955, 0.00032882906026820123]

Probability distribution AFTER applying roulette wheel:
[0.658569342009469, 0.2501187014737071, 0.07757423929312707, 0.006466404091470225, 0.005236430725819944, 0.000723513744500165, 0.0006104647219220143, 0.0003753227549594606, 0.0003255811850250743]

Successful!


**Handling missing values of light condition (LIGHTCOND)**

In [45]:
df.LIGHTCOND.value_counts()

Daylight                    119148
Dark - Street Lights On      50048
Unknown                      13520
Dusk                          6074
Dawn                          2599
Dark - No Street Lights       1573
Dark - Street Lights Off      1236
Other                          244
Dark - Unknown Lighting         20
Name: LIGHTCOND, dtype: int64

We note that there are 6% of the accidents occurred in <font color="red"> unknown road condition </font>. Intuititively, we believe that the road condition has to be related to the severity of the collision. Therefore, it is imperative to predict these <font color="red"> unknown </font>. For this, we propose two solutions:

- First, we can impute the value of the unknowns based on a model trained on the available data.
- Second we can impute the value of the unknowns using our conventional roulette wheel selection which preserve the probability distribution of `LIGHTCOND`.

Due to its simplicity, we chose the second method. However, the first approach should be preferred in general since it makes informative decision based on the other features in the dataset.

In [46]:
# First convert "Unknown" to nan
df['LIGHTCOND'] = df['LIGHTCOND'].replace(to_replace="Unknown", value=np.nan)
# Replace the missing values using the roulette wheel selection
roulette_wheel_replacena(df, 'LIGHTCOND', show=True)

Probability distribution BEFORE applying roulette wheel:
[0.6584872500580297, 0.2765969205601795, 0.03356876789247383, 0.014363718760707852, 0.008693393463098673, 0.006830918194780648, 0.001348498413856374, 0.00011053265687347326]

Probability distribution AFTER applying roulette wheel:
[0.6578367843431626, 0.27734542807142887, 0.033516774213970146, 0.014370791750134528, 0.008650511207680098, 0.006868858611848442, 0.001297802779197171, 0.00011304902257815078]

Successful!


<font color="blue"> **Great! All missing values have been handled! </font>

Do one last check that the data is free of missing values

In [52]:
displayna(df, showall=True)

Column Name:  X
Missing Value?  False  Percentage:  100.0

Column Name:  Y
Missing Value?  False  Percentage:  100.0

Column Name:  STATUS
Missing Value?  False  Percentage:  100.0

Column Name:  ADDRTYPE
Missing Value?  False  Percentage:  100.0

Column Name:  LOCATION
Missing Value?  False  Percentage:  100.0

Column Name:  SEVERITYCODE
Missing Value?  False  Percentage:  100.0

Column Name:  COLLISIONTYPE
Missing Value?  False  Percentage:  100.0

Column Name:  PERSONCOUNT
Missing Value?  False  Percentage:  100.0

Column Name:  PEDCOUNT
Missing Value?  False  Percentage:  100.0

Column Name:  PEDCYLCOUNT
Missing Value?  False  Percentage:  100.0

Column Name:  VEHCOUNT
Missing Value?  False  Percentage:  100.0

Column Name:  INJURIES
Missing Value?  False  Percentage:  100.0

Column Name:  SERIOUSINJURIES
Missing Value?  False  Percentage:  100.0

Column Name:  FATALITIES
Missing Value?  False  Percentage:  100.0

Column Name:  JUNCTIONTYPE
Missing Value?  False  Percentage:  100.0

## Data Normalization

In this section, we normalize the data and transform it into a format that is more convenient for the machine learning algorithms

In [53]:
# First let us check the data types of the features and correct it if necessary
df.dtypes

X                  float64
Y                  float64
STATUS              object
ADDRTYPE            object
LOCATION            object
SEVERITYCODE        object
COLLISIONTYPE       object
PERSONCOUNT          int64
PEDCOUNT             int64
PEDCYLCOUNT          int64
VEHCOUNT             int64
INJURIES             int64
SERIOUSINJURIES      int64
FATALITIES           int64
JUNCTIONTYPE        object
UNDERINFL           object
WEATHER             object
ROADCOND            object
LIGHTCOND           object
SEGLANEKEY           int64
CROSSWALKKEY         int64
dtype: object

# 3. Methodology <a class="anchor" id="method"></a>

In [None]:
df.replace?