I'll learn how to handle missing data without having to drop rows and columns using data on motor vehicle collisions released by New York City and published on the <b>NYC OpenData website</b>.
There is data on over 1.5 million collisions dating back to 2012, with additional data continuously added.

I'll work with an extract of the full data: Crashes from the year 2018.

The data has been modified form the original to fit this scenario.

In [20]:
import pandas as pd
import numpy as np
pd.set_option('display.max_columns', 999)
pd.set_option('display.max_rows', 900)
pd.set_option('display.max_colwidth', -1)

mvc = pd.read_csv(r"C:\Users\lumum\Documents\Data Projects\working-with-missing-data\nypd_mvc_2018.csv")

mvc.head(50)

Unnamed: 0,unique_key,date,time,borough,location,on_street,cross_street,off_street,pedestrians_injured,cyclist_injured,motorist_injured,total_injured,pedestrians_killed,cyclist_killed,motorist_killed,total_killed,vehicle_1,vehicle_2,vehicle_3,vehicle_4,vehicle_5,cause_vehicle_1,cause_vehicle_2,cause_vehicle_3,cause_vehicle_4,cause_vehicle_5
0,3869058,2018-03-23,21:40,MANHATTAN,"(40.742832, -74.00771)",WEST 15 STREET,10 AVENUE,,0,0,0,0.0,0,0,0,0.0,PASSENGER VEHICLE,,,,,Following Too Closely,Unspecified,,,
1,3847947,2018-02-13,14:45,BROOKLYN,"(40.623714, -73.99314)",16 AVENUE,62 STREET,,0,0,0,0.0,0,0,0,0.0,SPORT UTILITY / STATION WAGON,DS,,,,Backing Unsafely,Unspecified,,,
2,3914294,2018-06-04,0:00,,"(40.591755, -73.9083)",BELT PARKWAY,,,0,0,1,1.0,0,0,0,0.0,Station Wagon/Sport Utility Vehicle,Sedan,,,,Following Too Closely,Unspecified,,,
3,3915069,2018-06-05,6:36,QUEENS,"(40.73602, -73.87954)",GRAND AVENUE,VANLOON STREET,,0,0,0,0.0,0,0,0,0.0,Sedan,Sedan,,,,Glare,Passing Too Closely,,,
4,3923123,2018-06-16,15:45,BRONX,"(40.884727, -73.89945)",,,208 WEST 238 STREET,0,0,0,0.0,0,0,0,0.0,Station Wagon/Sport Utility Vehicle,Sedan,,,,Turning Improperly,Unspecified,,,
5,3987177,2018-09-14,11:50,,"(40.785984, -73.95718)",EAST 93 STREET,,,0,0,0,0.0,0,0,0,0.0,Station Wagon/Sport Utility Vehicle,Box Truck,,,,Driver Inattention/Distraction,Passing Too Closely,,,
6,4008417,2018-10-19,11:00,QUEENS,"(40.731968, -73.923225)",54 AVENUE,44 STREET,,0,0,0,0.0,0,0,0,0.0,Sedan,Sedan,,,,Unspecified,Unspecified,,,
7,3917518,2018-06-05,10:00,,"(40.660114, -74.00191)",3 AVENUE,,,0,0,0,0.0,0,0,0,0.0,Station Wagon/Sport Utility Vehicle,Station Wagon/Sport Utility Vehicle,,,,Unspecified,Unspecified,,,
8,3953286,2018-08-03,22:30,QUEENS,"(40.666393, -73.75177)",NORTH CONDUIT AVENUE,225 STREET,,0,0,3,3.0,0,0,0,0.0,Sedan,Station Wagon/Sport Utility Vehicle,,,,Unspecified,Unspecified,,,
9,3896388,2018-05-08,8:40,QUEENS,"(40.715763, -73.737755)",218 STREET,99 AVENUE,,0,0,0,0.0,0,0,0,0.0,SPORT UTILITY / STATION WAGON,,,,,Unspecified,,,,


A summary of the columns and their data is below:
<ul>
    <li><b>unique_key</b>: A unique identifier for each collision.</li>
    <li><b>date, time</b>: Date and time of the collision.</li>
    <li><b>borough</b>: The borough, or area of New York City, where the collision occurred.</li>
    <li><b>location</b>: Latitude and longitude coordinates for the collision.</li>
    <li><b>on_street, cross_street, off_street</b>: Details of the street or intersection where the collision occurred.</li>
    <li><b>pedestrians_injured</b>: Number of pedestrians who were injured.</li>
    <li><b>cyclist_injured</b>: Number of people traveling on a bicycle who were injured.</li>
    <li><b>motorist_injured</b>: Number of people traveling in a vehicle who were injured.</li>
    <li><b>total_injured</b>: Total number of people injured.</li>
    <li><b>pedestrians_killed</b>: Number of pedestrians who were killed.</li>
    <li><b>cyclist_killed</b>: Number of people traveling on a bicycle who were killed.</li>
    <li><b>motorist_killed</b>: Number of people traveling in a vehicle who were killed.</li>
    <li><b>total_killed</b>: Total number of people killed.</li>
    <li><b>vehicle_1 through vehicle_5</b>: Type of each vehicle involved in the accident.</li>
    <li><b>cause_vehicle_1 through cause_vehicle_5</b>: Contributing factor for each vehicle in the accident.</li>
    </ul>

Counting missing values:

In [24]:
data = np.random.choice([1.0, np.nan], size=(3, 3), p=[.3, .7])
df = pd.DataFrame(data, columns=['A','B','C'])
print(df)


     A    B    C
0 NaN  NaN  NaN 
1  1.0  1.0  1.0
2  1.0 NaN  NaN 


Next, we can use the <b>DataFrame.isnull()</b> method to identify which values are null:

In [26]:
df.isnull()

Unnamed: 0,A,B,C
0,True,True,True
1,False,False,False
2,False,True,True


Then we can chain the result to <b>DataFrame.sum()</b> method to count the number of null values in each column:

In [29]:
df.isnull().sum()
#total null values in the dataframe

A    1
B    2
C    2
dtype: int64

Let's use this technique to count the null values in our data set.

In [31]:
import pandas as pd
mvc = pd.read_csv("nypd_mvc_2018.csv")

null_counts = mvc.isnull().sum()
printnull_counts

unique_key             0    
date                   0    
time                   0    
borough                20646
location               3885 
on_street              13961
cross_street           29249
off_street             44093
pedestrians_injured    0    
cyclist_injured        0    
motorist_injured       0    
total_injured          1    
pedestrians_killed     0    
cyclist_killed         0    
motorist_killed        0    
total_killed           5    
vehicle_1              355  
vehicle_2              12262
vehicle_3              54352
vehicle_4              57158
vehicle_5              57681
cause_vehicle_1        175  
cause_vehicle_2        8692 
cause_vehicle_3        54134
cause_vehicle_4        57111
cause_vehicle_5        57671
dtype: int64