In [39]:
import pandas as pd
import numpy as np

In [40]:
path = '../../pandas-workout-data/data/nyc-parking-violations-2020.csv'
columns = ['Plate ID', 'Registration State', 'Vehicle Make', 'Vehicle Color', 'Violation Time', 'Street Name']

In [41]:
df = pd.read_csv(filepath_or_buffer=path, usecols=columns)
df

Unnamed: 0,Plate ID,Registration State,Vehicle Make,Violation Time,Street Name,Vehicle Color
0,J58JKX,NJ,HONDA,0523P,43 ST,BK
1,KRE6058,PA,ME/BE,0428P,UNION ST,BLK
2,444326R,NJ,LEXUS,0625A,CLERMONT AVENUE,BLACK
3,F728330,OH,CHEVR,1106A,DIVISION AVE,
4,FMY9090,NY,JEEP,1253A,GRAND ST,GREY
...,...,...,...,...,...,...
12495729,62161MM,NY,FORD,1111A,3RD AVE,BR
12495730,GYE7330,NY,HONDA,0444P,PELHAM PARK DR,BLK
12495731,HNY4802,NY,FORD,0210A,LYDIG AVE,GY
12495732,T687081C,NY,TOYOT,0225P,E 68 STREET,BLK


How many rows are in the data frame when it is read into memory?

In [42]:
df.size

74974404

In [43]:
df.shape

(12495734, 6)

In [44]:
df.shape[0]

12495734

In [45]:
len(df) # This turns out to be a better way

12495734

The author says: Not only does this give the same answer, but in my testing, I found that len was twice as fast as shape[0]. ``But we can do even better by running len on df.index:``

In [46]:
len(df.index)

12495734

Remove rows with any missing data (i.e., a NaN value). How many rows remain after doing this pruning? If each parking ticket brings $100 into the city, and missing data means the ticket can be successfully contested, how much money may New York City lose due to such missing data?

In [47]:
all_good_df = df.dropna()

In [48]:
all_good_df

Unnamed: 0,Plate ID,Registration State,Vehicle Make,Violation Time,Street Name,Vehicle Color
0,J58JKX,NJ,HONDA,0523P,43 ST,BK
1,KRE6058,PA,ME/BE,0428P,UNION ST,BLK
2,444326R,NJ,LEXUS,0625A,CLERMONT AVENUE,BLACK
4,FMY9090,NY,JEEP,1253A,GRAND ST,GREY
5,KDG0693,PA,HYUN,0525P,B 99 ST,GY
...,...,...,...,...,...,...
12495729,62161MM,NY,FORD,1111A,3RD AVE,BR
12495730,GYE7330,NY,HONDA,0444P,PELHAM PARK DR,BLK
12495731,HNY4802,NY,FORD,0210A,LYDIG AVE,GY
12495732,T687081C,NY,TOYOT,0225P,E 68 STREET,BLK


How many rows were deleted by dropna()

In [49]:
len(df.index) - len(all_good_df.index)

447359

 That represents about 3.5% of the data in the original data frame—which doesn’t sound like much ``until we consider the next question``: how much money New York City would lose if all those tickets were thrown out. Assuming that each parking ticket costs $100, we can calculate the total as follows:

In [50]:
f'${(len(df.index) - len(all_good_df.index) ) * 100:,}'

'$44,735,900'

Let’s instead assume that a ticket can only be dismissed if the **license plate, state, car make, and/or street name are missing**. Remove rows that are missing one or more of these. How many rows remain? Assuming $100/ticket, how much money would the city lose as a result of this missing data?

In [51]:
df.columns

Index(['Plate ID', 'Registration State', 'Vehicle Make', 'Violation Time',
       'Street Name', 'Vehicle Color'],
      dtype='object')

In [52]:
df['Plate ID'].notnull()

0           True
1           True
2           True
3           True
4           True
            ... 
12495729    True
12495730    True
12495731    True
12495732    True
12495733    True
Name: Plate ID, Length: 12495734, dtype: bool

In [53]:
df['Registration State'].notnull()

0           True
1           True
2           True
3           True
4           True
            ... 
12495729    True
12495730    True
12495731    True
12495732    True
12495733    True
Name: Registration State, Length: 12495734, dtype: bool

In [54]:
df['Vehicle Make'].notnull()

0           True
1           True
2           True
3           True
4           True
            ... 
12495729    True
12495730    True
12495731    True
12495732    True
12495733    True
Name: Vehicle Make, Length: 12495734, dtype: bool

In [55]:
df['Street Name'].notnull()

0           True
1           True
2           True
3           True
4           True
            ... 
12495729    True
12495730    True
12495731    True
12495732    True
12495733    True
Name: Street Name, Length: 12495734, dtype: bool

In [None]:
df[
    df['Plate ID'].notnull() 
    & df['Registration State'].notnull()
    & df['Vehicle Make'].notnull()
    & df['Street Name'].notnull()
   ]

This works. But there’s a better way to do things, using dropna. Normally, as we just saw, dropna removes rows that contain any NaN values. But we can tell it to look in only a subset of the columns, ignoring NaN values in any other columns. The result is a much cleaner query

In [56]:

semi_good_df = df.dropna(subset=['Plate ID',
                                 'Registration State',
                                 'Vehicle Make',
                                 'Street Name'])

In [57]:
semi_good_df

Unnamed: 0,Plate ID,Registration State,Vehicle Make,Violation Time,Street Name,Vehicle Color
0,J58JKX,NJ,HONDA,0523P,43 ST,BK
1,KRE6058,PA,ME/BE,0428P,UNION ST,BLK
2,444326R,NJ,LEXUS,0625A,CLERMONT AVENUE,BLACK
3,F728330,OH,CHEVR,1106A,DIVISION AVE,
4,FMY9090,NY,JEEP,1253A,GRAND ST,GREY
...,...,...,...,...,...,...
12495729,62161MM,NY,FORD,1111A,3RD AVE,BR
12495730,GYE7330,NY,HONDA,0444P,PELHAM PARK DR,BLK
12495731,HNY4802,NY,FORD,0210A,LYDIG AVE,GY
12495732,T687081C,NY,TOYOT,0225P,E 68 STREET,BLK


In [58]:
f'${(len(df.index) - len(semi_good_df.index)) * 100:,}'

'$6,378,500'

According to this calculation, the result is $6,378,500. Still a fair amount of money, but a far cry from what we would have lost had we removed any and all problematic records.

But let’s make the rules looser still, mandating only that three of the columns lack NaN values: Plate ID, Registration State, and Street Name. Once again, we can use df.dropna along with its subset parameter to remove only those rows that lack all three of these columns:

In [59]:
loosest_df = df.dropna(subset=[
    'Plate ID',
    'Registration State',
    'Street Name'
])
loosest_df

Unnamed: 0,Plate ID,Registration State,Vehicle Make,Violation Time,Street Name,Vehicle Color
0,J58JKX,NJ,HONDA,0523P,43 ST,BK
1,KRE6058,PA,ME/BE,0428P,UNION ST,BLK
2,444326R,NJ,LEXUS,0625A,CLERMONT AVENUE,BLACK
3,F728330,OH,CHEVR,1106A,DIVISION AVE,
4,FMY9090,NY,JEEP,1253A,GRAND ST,GREY
...,...,...,...,...,...,...
12495729,62161MM,NY,FORD,1111A,3RD AVE,BR
12495730,GYE7330,NY,HONDA,0444P,PELHAM PARK DR,BLK
12495731,HNY4802,NY,FORD,0210A,LYDIG AVE,GY
12495732,T687081C,NY,TOYOT,0225P,E 68 STREET,BLK


This removes only 1,618 rows from our original data frame. How much money would that translate into?

In [60]:
f'${(len(df.index) - len(loosest_df.index)) * 100:,}'

'$161,800'

According to this calculation, it works out to $161,800, which seems like a far more reasonable amount of lost revenue.

### Beyond the exercise

So far, you have specified which columns must be all non-null. But sometimes it’s OK for some columns to have null values, as long as it’s not too many. How many rows would you eliminate if you required at least three non-null values from the four columns Plate ID, Registration State, Vehicle Make, and Street Name?

In [61]:
df_with_atleast_3NotNull_values = df.dropna(
    subset=['Plate ID', 'Registration State', 'Vehicle Make', 'Street Name'],
    thresh=3
)
df_with_atleast_3NotNull_values

Unnamed: 0,Plate ID,Registration State,Vehicle Make,Violation Time,Street Name,Vehicle Color
0,J58JKX,NJ,HONDA,0523P,43 ST,BK
1,KRE6058,PA,ME/BE,0428P,UNION ST,BLK
2,444326R,NJ,LEXUS,0625A,CLERMONT AVENUE,BLACK
3,F728330,OH,CHEVR,1106A,DIVISION AVE,
4,FMY9090,NY,JEEP,1253A,GRAND ST,GREY
...,...,...,...,...,...,...
12495729,62161MM,NY,FORD,1111A,3RD AVE,BR
12495730,GYE7330,NY,HONDA,0444P,PELHAM PARK DR,BLK
12495731,HNY4802,NY,FORD,0210A,LYDIG AVE,GY
12495732,T687081C,NY,TOYOT,0225P,E 68 STREET,BLK


In [62]:
len(df.index) - len(df_with_atleast_3NotNull_values.index)

253

Which of the columns you’ve imported has the greatest number of NaN values? Is this a problem?

In [None]:
df.isnull()

Unnamed: 0,Plate ID,Registration State,Vehicle Make,Violation Time,Street Name,Vehicle Color
0,False,False,False,False,False,False
1,False,False,False,False,False,False
2,False,False,False,False,False,False
3,False,False,False,False,False,True
4,False,False,False,False,False,False
...,...,...,...,...,...,...
12495729,False,False,False,False,False,False
12495730,False,False,False,False,False,False
12495731,False,False,False,False,False,False
12495732,False,False,False,False,False,False


In [65]:
df.isnull().sum()

Plate ID                 202
Registration State         0
Vehicle Make           62420
Violation Time           278
Street Name             1417
Vehicle Color         391982
dtype: int64

Null data is bad, but there is plenty of bad non-null data, too. For example, many cars with BLANKPLATE as a plate ID were ticketed. Turn these into NaN values, and rerun the previous query.

In [66]:
(df['Plate ID'] == 'BLANKPLATE').sum()

np.int64(8882)

In [67]:
# def set_nan(x):
#     if x == 'BLANKPLATE':
#         return np.nan
#     else:
#         return x

In [68]:
# df['Plate ID'] = df['Plate ID'].apply(set_nan) 

In [69]:
df

Unnamed: 0,Plate ID,Registration State,Vehicle Make,Violation Time,Street Name,Vehicle Color
0,J58JKX,NJ,HONDA,0523P,43 ST,BK
1,KRE6058,PA,ME/BE,0428P,UNION ST,BLK
2,444326R,NJ,LEXUS,0625A,CLERMONT AVENUE,BLACK
3,F728330,OH,CHEVR,1106A,DIVISION AVE,
4,FMY9090,NY,JEEP,1253A,GRAND ST,GREY
...,...,...,...,...,...,...
12495729,62161MM,NY,FORD,1111A,3RD AVE,BR
12495730,GYE7330,NY,HONDA,0444P,PELHAM PARK DR,BLK
12495731,HNY4802,NY,FORD,0210A,LYDIG AVE,GY
12495732,T687081C,NY,TOYOT,0225P,E 68 STREET,BLK


In [70]:
df['Plate ID'] = df['Plate ID'].replace('BLANKPLATE', np.nan)

In [71]:
(df['Plate ID'] == 'BLANKPLATE').sum()

np.int64(0)

In [72]:
df_with_atleast_3NotNull_values = df.dropna(
    subset=['Plate ID', 'Registration State', 'Vehicle Make', 'Street Name'],
    thresh=3
)
df_with_atleast_3NotNull_values

Unnamed: 0,Plate ID,Registration State,Vehicle Make,Violation Time,Street Name,Vehicle Color
0,J58JKX,NJ,HONDA,0523P,43 ST,BK
1,KRE6058,PA,ME/BE,0428P,UNION ST,BLK
2,444326R,NJ,LEXUS,0625A,CLERMONT AVENUE,BLACK
3,F728330,OH,CHEVR,1106A,DIVISION AVE,
4,FMY9090,NY,JEEP,1253A,GRAND ST,GREY
...,...,...,...,...,...,...
12495729,62161MM,NY,FORD,1111A,3RD AVE,BR
12495730,GYE7330,NY,HONDA,0444P,PELHAM PARK DR,BLK
12495731,HNY4802,NY,FORD,0210A,LYDIG AVE,GY
12495732,T687081C,NY,TOYOT,0225P,E 68 STREET,BLK


In [73]:
len(df.index) - len(df_with_atleast_3NotNull_values.index)

944