# Data Cleaning / Pre-processing

In [6]:
import pandas as pd
import numpy as np

import warnings
warnings.filterwarnings('ignore')

# About Data 

- The data we're going to use can be accessed here: https://data.cityofnewyork.us/Public-Safety/Motor-Vehicle-Collisions-Crashes/h9gi-nx95 
- It is about Motor Vehicle Collisions crash. Each row represents a crash event. The Motor Vehicle Collisions data contains information from all police reported motor vehicle collisions in NYC. It has about 600,000 rows and 29 columns.

# Data Cleaning / Processing

- As the dataset is huge, we're going to take 50,000 random samples from it to ease our process and to make it efficient for this task.
- Also, there are a number of features that we actually don't need for our problem.
- Hence, we have to drop those features since they will be having no effect on our output. 
- Finally, comes the part of dealing with the missing values. Since some of the features contain alot of missing values (including our target feature), we will have to fix them. Starting from our target feature, since it is a categorical feature and our dataset is huge, so we're simply going to drop missing values in our target variable.
- After that, we're going to compute missing values for the test of our features. Since our data contains no outliers at all, we will fill all the missing values using the mean strategy. After that, we will fix the datatype of each column.

In [7]:
#importing our dataset
df = pd.read_csv("Motor_Vehicle_Collisions_-_Crashes.csv")

In [8]:
df.head()

Unnamed: 0,CRASH DATE,CRASH TIME,BOROUGH,ZIP CODE,LATITUDE,LONGITUDE,LOCATION,ON STREET NAME,CROSS STREET NAME,OFF STREET NAME,...,CONTRIBUTING FACTOR VEHICLE 2,CONTRIBUTING FACTOR VEHICLE 3,CONTRIBUTING FACTOR VEHICLE 4,CONTRIBUTING FACTOR VEHICLE 5,COLLISION_ID,VEHICLE TYPE CODE 1,VEHICLE TYPE CODE 2,VEHICLE TYPE CODE 3,VEHICLE TYPE CODE 4,VEHICLE TYPE CODE 5
0,04/14/2021,5:32,,,,,,BRONX WHITESTONE BRIDGE,,,...,Unspecified,,,,4407480,Sedan,Sedan,,,
1,04/13/2021,21:35,BROOKLYN,11217.0,40.68358,-73.97617,"(40.68358, -73.97617)",,,620 ATLANTIC AVENUE,...,,,,,4407147,Sedan,,,,
2,04/15/2021,16:15,,,,,,HUTCHINSON RIVER PARKWAY,,,...,,,,,4407665,Station Wagon/Sport Utility Vehicle,,,,
3,04/13/2021,16:00,BROOKLYN,11222.0,,,,VANDERVORT AVENUE,ANTHONY STREET,,...,Unspecified,,,,4407811,Sedan,,,,
4,04/12/2021,8:25,,,0.0,0.0,"(0.0, 0.0)",EDSON AVENUE,,,...,Unspecified,,,,4406885,Station Wagon/Sport Utility Vehicle,Sedan,,,


In [9]:
df.shape

(1838945, 29)

In [10]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1838945 entries, 0 to 1838944
Data columns (total 29 columns):
 #   Column                         Dtype  
---  ------                         -----  
 0   CRASH DATE                     object 
 1   CRASH TIME                     object 
 2   BOROUGH                        object 
 3   ZIP CODE                       object 
 4   LATITUDE                       float64
 5   LONGITUDE                      float64
 6   LOCATION                       object 
 7   ON STREET NAME                 object 
 8   CROSS STREET NAME              object 
 9   OFF STREET NAME                object 
 10  NUMBER OF PERSONS INJURED      float64
 11  NUMBER OF PERSONS KILLED       float64
 12  NUMBER OF PEDESTRIANS INJURED  int64  
 13  NUMBER OF PEDESTRIANS KILLED   int64  
 14  NUMBER OF CYCLIST INJURED      int64  
 15  NUMBER OF CYCLIST KILLED       int64  
 16  NUMBER OF MOTORIST INJURED     int64  
 17  NUMBER OF MOTORIST KILLED      int64  
 18  CO

In [11]:
#Taking 50,000 random samples from our data.
df = df.sample(n=50000)

In [12]:
#Checking the shape of our data
df.shape

(50000, 29)

In [13]:
#Statistical summary of our data
df.describe()

Unnamed: 0,LATITUDE,LONGITUDE,NUMBER OF PERSONS INJURED,NUMBER OF PERSONS KILLED,NUMBER OF PEDESTRIANS INJURED,NUMBER OF PEDESTRIANS KILLED,NUMBER OF CYCLIST INJURED,NUMBER OF CYCLIST KILLED,NUMBER OF MOTORIST INJURED,NUMBER OF MOTORIST KILLED,COLLISION_ID
count,44281.0,44281.0,50000.0,49999.0,50000.0,50000.0,50000.0,50000.0,50000.0,50000.0,50000.0
mean,40.65931,-73.801483,0.28238,0.00108,0.05302,0.00064,0.02414,0.0001,0.2041,0.0003,2982882.0
std,1.631298,2.958838,0.680037,0.03345,0.236241,0.02529,0.155041,0.01,0.641725,0.017318,1498635.0
min,0.0,-74.742,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,26.0
25%,40.66925,-73.975882,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,2998134.0
50%,40.721676,-73.928329,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,3557566.0
75%,40.769524,-73.866598,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,4018561.0
max,40.912884,0.0,14.0,2.0,6.0,1.0,2.0,1.0,14.0,1.0,4475822.0
