# New York City: Parking Tickets Analysis
![New York City](https://blog-www.pods.com/wp-content/uploads/2019/04/MG_1_1_New_York_City-1.jpg)

In this notebook, we shall be analysing the data that NYC Department of Finance collects on every **parking ticket** issued in the city to deliver insights to the governing body (let's just imagine one 😋)

You can find more details about the dataset [**here**](https://www.kaggle.com/new-york-city/nyc-parking-tickets)

The notebook consists of four parts - **Environment Setup**, **Reading Data**, **Cleaning** & **Analysis**

We shall be using the first **5 Million** rows from the original dataset for our analysis

*So let's get started!*

## I. Environment Setup

In [None]:
# Import Python libraries
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
import datetime as dt

# Silence the warning 'SettingWithCopyWarning'
pd.options.mode.chained_assignment = None

## II. Reading Data

In [None]:
# Create Pandas dataframe with only 5 Million rows
nyc_parking_tickets = pd.read_csv("../input/nyc-parking-tickets/Parking_Violations_Issued_-_Fiscal_Year_2017.csv",nrows=5000000)

In [None]:
# View top 5 rows
nyc_parking_tickets.head(5)

In [None]:
# View shape of dataframe (rows,columns)
nyc_parking_tickets.shape

## III. Cleaning

In [None]:
# Change column names
nyc_parking_tickets.columns = ['SummonsNumber', 'PlateID', 'RegistrationState', 'PlateType', 'IssueDate', 'ViolationCode', 'VehicleBodyType', 'VehicleMake', 'IssuingAgency', 'StreetCode1', 'StreetCode2', 'StreetCode3','VehicleExpirationDate', 'ViolationLocation', 'ViolationPrecinct','IssuerPrecinct', 'IssuerCode', 'IssuerCommand', 'IssuerSquad', 'ViolationTime', 'TimeFirstObserved', 'ViolationCounty', 'ViolationInFrontOfOrOpposite', 'HouseNumber', 'StreetName', 'IntersectingStreet', 'DateFirstObserved', 'LawSection', 'SubDivision', 'ViolationLegalCode', 'DaysParkingInEffect', 'FromHoursInEffect', 'ToHoursInEffect', 'VehicleColor', 'UnregisteredVehicle', 'VehicleYear', 'MeterNumber', 'FeetFromCurb', 'ViolationPostCode', 'ViolationDescription', 'NoStandingOrStoppingViolation', 'HydrantViolation', 'DoubleParkingViolation']

# Drop columns with > 80% null values
columns_to_drop = ((nyc_parking_tickets.isna().sum()/len(nyc_parking_tickets))*100) > 80
nyc_parking_tickets.drop(columns_to_drop[columns_to_drop.values == True].index.tolist(),axis=1,inplace=True)

# Convert 'IssueDate' to datetime
nyc_parking_tickets['IssueDate'] = pd.to_datetime(nyc_parking_tickets['IssueDate'],format='%m/%d/%Y',errors='coerce')

# Replace '99' in RegistrationState column by null
nyc_parking_tickets['RegistrationState'] = nyc_parking_tickets['RegistrationState'].replace({'99': None})

# Replace '999' in PlateType column by null
nyc_parking_tickets['PlateType'] = nyc_parking_tickets['PlateType'].replace({'999': None})

# Convert incorrent values (88888888, etc.) in 'VehicleExpirationDate' column to Null
incorrect_values = nyc_parking_tickets[(nyc_parking_tickets['VehicleExpirationDate'] > 20990101)]['VehicleExpirationDate'].unique().tolist()
nyc_parking_tickets[nyc_parking_tickets['VehicleExpirationDate'].isin(incorrect_values)] = None

# Convert 'ViolationTime' column to stirng and convert values not containing 'A/P' (ambiguous) to null
nyc_parking_tickets['ViolationTime'] = nyc_parking_tickets['ViolationTime'].astype('str')
ViolationTime_ambiguous = nyc_parking_tickets[~nyc_parking_tickets['ViolationTime'].str.contains('P|A')]['ViolationTime'].unique().tolist()
nyc_parking_tickets[nyc_parking_tickets['ViolationTime'].isin(ViolationTime_ambiguous)] = None

# Fix 'DateFirstObserved' column
    # Replace NaN values with 0
    # Replace 0 with null
    # Convert the column to datetime
nyc_parking_tickets['DateFirstObserved'] = nyc_parking_tickets['DateFirstObserved'].replace({np.NaN:0}).astype('int') 
nyc_parking_tickets['DateFirstObserved'] = nyc_parking_tickets['DateFirstObserved'].replace({0:None})
nyc_parking_tickets['DateFirstObserved'] = pd.to_datetime(nyc_parking_tickets['DateFirstObserved'],format='%Y%m%d',errors='coerce')

# Drop duplicates based on 'SummonsNumber' column as summon numbers should be unique
nyc_parking_tickets.drop_duplicates(subset = ['SummonsNumber'], inplace = True)

## IV. Analysis

In [None]:
# Issuing agency vs Summons dataframe
ia_plot = nyc_parking_tickets.groupby(['IssuingAgency']).count()['SummonsNumber'].sort_values(ascending=False).reset_index()
# Bar plot
sns.barplot(x = 'IssuingAgency', y = 'SummonsNumber', data = ia_plot)

Agency with code **T** was issued the maximum number of summons

In [None]:
# Vehicle registration state (Top 10) vs Summons
nyc_parking_tickets.groupby(['RegistrationState']).count()['SummonsNumber'].sort_values(ascending=False).reset_index().head(10)

The vehicle registration state with the highest number of summons was **New York**

In [None]:
# Plate Type (Top 10) vs Summons
nyc_parking_tickets.groupby(['PlateType']).count()['SummonsNumber'].sort_values(ascending=False).reset_index().head(10)

The vehicle plate type with the highest number of summons was **PAS**

In [None]:
# Plate ID (Top 5) vs Summons
nyc_parking_tickets.groupby(['PlateID']).count()['SummonsNumber'].sort_values(ascending=False).reset_index().head(5)

Amongst the different registered vehicle plate IDs, the maximum number of tickets were issued to **no-plate-id (blank plates)**

In [None]:
# Issue date time series plot
issue_date = nyc_parking_tickets.loc[:,['IssueDate','SummonsNumber']].groupby('IssueDate').count()['SummonsNumber'].reset_index()
sns.relplot(x = 'IssueDate', y = 'SummonsNumber', data = issue_date, kind = "line")
plt.xticks(rotation = 45)

From the above time series plot, we can observe that the peak of issuing tickets occured between the **September to November 2016** peiod (Time duration - July, 2016 to July, 2017) 

In [None]:
# Top 10 Violations dataframe
top_10_violations = nyc_parking_tickets.loc[:,['ViolationDescription','SummonsNumber']].groupby(['ViolationDescription']).count()['SummonsNumber'].reset_index().sort_values('SummonsNumber',ascending = False).head(10)
# Bar plot
sns.barplot(x = 'SummonsNumber', y = 'ViolationDescription', data = top_10_violations)

The maximum number of tickets were issued for **overspeeding near school premises** violation

In [None]:
# Violation Time (Top 10) vs Summons
nyc_parking_tickets.groupby(['ViolationTime']).count()['SummonsNumber'].sort_values(ascending=False).reset_index().head(10)

From the above table, it is quite evident that the maximum number of violations occured **between 8 AM to 12 PM**

In [None]:
# Vehicle County vs Summons
vc_plot = nyc_parking_tickets.groupby(['ViolationCounty']).count()['SummonsNumber'].sort_values(ascending=False).reset_index()
# Bar plot
sns.barplot(x = 'ViolationCounty', y = 'SummonsNumber', data = vc_plot)

The maximum number of violations occured in the county of **New York**

In [None]:
# Vehicle Body Type (Top 10) vs Summons
nyc_parking_tickets.groupby(['VehicleBodyType']).count()['SummonsNumber'].sort_values(ascending=False).reset_index().head(10)

**SUBN** vehicle body type accounted for the maximum number of violations 

In [None]:
# Vehicle Make (Top 10) vs Summons
nyc_parking_tickets.groupby(['VehicleMake']).count()['SummonsNumber'].sort_values(ascending=False).reset_index().head(10)

The highest number of tickets were issued to **Toyota** vehicles

In [None]:
# Vehicle Make + Body Type (Top 5) vs Summons
nyc_parking_tickets.groupby(['VehicleMake','VehicleBodyType']).count()['SummonsNumber'].sort_values(ascending=False).reset_index().head(5)

**Toyota** vehicles of body type **4DSD** accounted for the maximum number of summons

In [None]:
# Vehicle Location vs Summons
nyc_parking_tickets.groupby(['ViolationLocation']).count()['SummonsNumber'].sort_values(ascending=False).reset_index().head(10)

The general violation location number **19** was issued the maximum number of tickets

In [None]:
# Registration State, Violation Description & Summons Number
nyc_parking_tickets.loc[:,['RegistrationState','ViolationDescription','SummonsNumber']].groupby(['RegistrationState','ViolationDescription']).count()['SummonsNumber'].reset_index().sort_values('SummonsNumber',ascending = False).head(5)

The maximum number of tickets were issued for **overspeeding near school premises** violation with vehicles registered in NY 

In [None]:
# Time series plot (ViolationTime vs Summons)

# Concatenate 'M' to 'ViolationTime' column 
nyc_parking_tickets['ViolationTime'] = nyc_parking_tickets['ViolationTime'] + 'M'
# Convert the column to datetime
nyc_parking_tickets['ViolationTime'] = pd.to_datetime(nyc_parking_tickets['ViolationTime'],format='%I%M%p',errors='coerce')
# Extract hour from the 'ViolationTime' column 
nyc_parking_tickets['ViolationTime'] = nyc_parking_tickets['ViolationTime'].dt.strftime('%H')
# Prepare dataframe for time series plot
time_series_summons = nyc_parking_tickets.groupby(['ViolationTime']).count()['SummonsNumber'].reset_index().sort_values(['ViolationTime'])

# Time series plot using seaborn
sns.relplot(x = "ViolationTime", y = "SummonsNumber", data = time_series_summons, kind = "line", ci = None)
plt.xlabel("Hour of day (24 hours format)")
plt.xticks(rotation = 90)
plt.ylabel("Summons Count")
plt.show()

From the above line plot, we can see that the peak for the maximum number of summons in a day occured around **9 AM**

![](https://i.ytimg.com/vi/0QI4eG8D0Ic/maxresdefault.jpg)

This brings us to the end of the notebook

In this Kaggle notebook, we read 5 Million rows from the NYC Parking dataset, performed data wrangling and later analysed to deliver insights

Since I'm a beginner, I would love to have your valuable feedback and suggestions so that I can keep on improving

Also, if you liked my work, please consider upvoting this notebook, would mean a lot to me!

Thank you😄