# Data Cleaning / Pre-processing

In [1]:
import pandas as pd
import numpy as np

import warnings
warnings.filterwarnings('ignore')

# About Data 

- The data we're going to use can be accessed here: https://data.cityofnewyork.us/Public-Safety/Motor-Vehicle-Collisions-Crashes/h9gi-nx95 
- It is about Motor Vehicle Collisions crash. Each row represents a crash event. The Motor Vehicle Collisions data contains information from all police reported motor vehicle collisions in NYC. It has about 600,000 rows and 29 columns.

# Data Cleaning / Processing

- As the dataset is huge, we're going to take 50,000 random samples from it to ease our process and to make it efficient for this task.
- Also, there are a number of features that we actually don't need for our problem. Hence, we have to drop those features since they will be having no effect on our output. 
- Finally, comes the part of dealing with the missing values. Since some of the features contain alot of missing values (including our target feature), we will have to fix them. Starting from our target feature, since it is a categorical feature and our dataset is huge, so we're simply going to drop missing values in our target variable.
- After that, we're going to compute missing values for the test of our features. Since our data contains no outliers at all, we will fill all the missing values using the mean strategy.
- After that, we will fix the datatype of each column.

In [None]:
#importing our dataset
df = pd.read_csv("Motor_Vehicle_Collisions_-_Crashes.csv")

In [None]:
df.head()

In [None]:
df.shape

In [None]:
df.info()

In [None]:
#Taking 50,000 random samples from our data.
df = df.sample(n=50000)

In [None]:
#Checking the shape of our data
df.shape

In [None]:
#Statistical summary of our data
df.describe()

In [None]:
#Checking for missing values
df.isnull().sum()

In [None]:
#Keeping only the required features
columns = ["CRASH DATE", "CRASH TIME", "BOROUGH", "LATITUDE", "LONGITUDE", "NUMBER OF PERSONS INJURED",\
          "NUMBER OF PERSONS KILLED","NUMBER OF PEDESTRIANS INJURED", "NUMBER OF PEDESTRIANS KILLED",\
          "NUMBER OF CYCLIST INJURED", "NUMBER OF CYCLIST KILLED", "NUMBER OF MOTORIST INJURED",\
          "NUMBER OF MOTORIST KILLED"]

In [None]:
#Keeping only the required features in our dataset
df = df[columns]

In [None]:
df.dropna(subset=["BOROUGH"], inplace=True)
df.shape

In [None]:
#Checking for outliers
print("Mean of Latitude: ", df["LATITUDE"].mean())
print("Median of Latitude: ", df["LATITUDE"].median())
print("\nSince both the values are approx. same, it means there are no outliers in latitude & longitude.")

In [None]:
#Filling missing values
df["LATITUDE"].fillna(df["LATITUDE"].mean(), inplace=True)
df["LONGITUDE"].fillna(df["LONGITUDE"].mean(), inplace=True)
df.dropna(inplace=True)

In [None]:
#Checking again for missing values
df.isnull().sum()

In [None]:
#Fixing the datatype
df["LATITUDE"] = df["LATITUDE"].astype("float64")
df["LONGITUDE"] = df["LONGITUDE"].astype("float64")

In [None]:
#Saving our cleaned data in a csv file
df.to_csv("cleaned_data.csv")