# Dataset/ Topic Name --- "Traffic Violations in USA"


## Title and Author

**Author: Aryasri Meghna Vaishnavi**

**Link to the author's GitHub profile: https://github.com/Meghnaaryasri/UMBC-DATA606-Capstone**

**Link to the author's LinkedIn profile: https://www.linkedin.com/in/meghna-aryasri/**

**Link to your PowerPoint presentation file: https://umbc-my.sharepoint.com/:p:/g/personal/meghnaa1_umbc_edu/EeDqhAiATlpIgmAWf0OJi4UBp_VpSSM3h9FjRR13bp38eg?e=eogbe5**

**Link to the Youtube channel: https://youtu.be/DKaPGXUmXtw**

## Background


### What is it about?

**This project focuses on analyzing traffic violations in the United States, tracing back to the first recorded traffic ticket in 1899. It aims to explore the evolution, types, and impact of traffic violations on society and state revenue.**



### Why does it matter?

**Understanding traffic violations is crucial for enhancing road safety, shaping public policy, and assessing the financial implications for drivers and state budgets. Analyzing these data can reveal patterns and trends that may inform more effective traffic laws and enforcement strategies.**

## Research Questions

**Trends Analysis: What are the most common times of day or periods of the year when traffic violations peak, and how can this information be utilized to enhance predictive accuracy?**

**Influential Factors: Which factors most significantly influence the outcomes of traffic violations, such as the likelihood of accidents or repeat offenses, and how can these insights improve the predictive modeling process?**

**Intervention Strategies: What specific, data-driven interventions can be recommended to prevent severe outcomes from traffic violations, and how can these strategies be effectively implemented within the target region?**

### Columns

**Column Name      Description**                               

**Description**      Specifies the traffic violation type   

**Location**         Location of the incident

**Make**             Vehicle make involved

**Driver State**     State issuing the driver's license

**Time Of Stop**     Time when the violation was noted  

**Gender**           Gender of the driver

**Violation Type**   Classification of the violation (e.g., Citation)

### Target Variable
**Violation Type**: **This column is used as the target variable for predictions. It categorizes the type of violation, such as 'Citation', 'Warning', 'ESERO', SERO'.**

### Dimensions
**Number of Rows**: **132,728**

**Number of Columns**: **7**

## Project Procedure


### 1. Data Cleaning and Preparation:


**Import necessary libraries**

In [1]:
import warnings
warnings.filterwarnings('ignore')

In [2]:
import joblib
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import plotly.express as px
import seaborn as sns
from imblearn.over_sampling import SMOTE
from scipy.stats import randint as sp_randint
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier, VotingClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score
from sklearn.model_selection import train_test_split, cross_val_score, RandomizedSearchCV
from sklearn.naive_bayes import GaussianNB
from sklearn.neighbors import KNeighborsClassifier
from sklearn.preprocessing import LabelEncoder
from sklearn.tree import DecisionTreeClassifier


**Step 1 - Importing Libraries and Loading Data**

In [3]:
import pandas as pd

data = pd.read_csv('Trafficviolations.csv')
data.head()

Unnamed: 0,Description,Location,Make,Driver State,Time Of Stop,Gender,Violation Type
0,DRIVING VEHICLE ON HIGHWAY WITH SUSPENDED REGI...,8804 FLOWER AVE,FORD,MD,17:11:00,M,Citation
1,DRIVING WHILE IMPAIRED BY ALCOHOL,NORFOLK AVE / ST ELMO AVE,AUDI,MD,00:41:00,M,Citation
2,FAILURE TO STOP AT STOP SIGN,WISTERIA DR @ WARING STATION RD,TOYT,MD,23:12:00,F,Citation
3,DRIVER USING HANDS TO USE HANDHELD TELEPHONE W...,CLARENDON RD @ ELM ST. N/,HONDA,VA,16:10:00,M,Citation
4,FAILURE STOP AND YIELD AT THRU HWY,CHRISTOPHER AVE/MONTGOMERY VILLAGE AVE,HONDA,MD,12:52:00,F,Citation


**Step 2 - Displaying Last Entries of the Dataset**

In [4]:
data.tail()


Unnamed: 0,Description,Location,Make,Driver State,Time Of Stop,Gender,Violation Type
742154,OPERATING MOTOR VEHICLE IN CONDITION LIKELY TO...,NB 270 X16,GMC,VA,14:20:00,M,Warning
742155,OPERATING A M/V W/O PROPER REQUIRED EMERGENCY ...,NB 270 X16,GMC,VA,14:20:00,M,Warning
742156,DRIVER FAILING TO PREVENT AGAINST LOSS OF LOAD,NB 270 X16,GMC,VA,14:20:00,M,Warning
742157,"HEADLIGHT, TAILLIGHT, STOPLIGHT, TURN SIGNAL, ...",NB 270 X16,GMC,VA,14:20:00,M,Warning
742158,PARTS/ACCESSORIES NOT SPECIFICALLY PROVIDED FO...,NB 270 X16,GMC,VA,14:20:00,M,Warning


**Step 3 - General Information about the Dataset**

In [5]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 742159 entries, 0 to 742158
Data columns (total 7 columns):
 #   Column          Non-Null Count   Dtype 
---  ------          --------------   ----- 
 0   Description     742152 non-null  object
 1   Location        742158 non-null  object
 2   Make            742119 non-null  object
 3   Driver State    742149 non-null  object
 4   Time Of Stop    742159 non-null  object
 5   Gender          742159 non-null  object
 6   Violation Type  742159 non-null  object
dtypes: object(7)
memory usage: 39.6+ MB


In [6]:
data.corr

<bound method DataFrame.corr of                                               Description  \
0       DRIVING VEHICLE ON HIGHWAY WITH SUSPENDED REGI...   
1                       DRIVING WHILE IMPAIRED BY ALCOHOL   
2                            FAILURE TO STOP AT STOP SIGN   
3       DRIVER USING HANDS TO USE HANDHELD TELEPHONE W...   
4                      FAILURE STOP AND YIELD AT THRU HWY   
...                                                   ...   
742154  OPERATING MOTOR VEHICLE IN CONDITION LIKELY TO...   
742155  OPERATING A M/V W/O PROPER REQUIRED EMERGENCY ...   
742156     DRIVER FAILING TO PREVENT AGAINST LOSS OF LOAD   
742157  HEADLIGHT, TAILLIGHT, STOPLIGHT, TURN SIGNAL, ...   
742158  PARTS/ACCESSORIES NOT SPECIFICALLY PROVIDED FO...   

                                      Location   Make Driver State  \
0                              8804 FLOWER AVE   FORD           MD   
1                   NORFOLK AVE /  ST ELMO AVE   AUDI           MD   
2              WISTERIA D

In [7]:
data.describe()

Unnamed: 0,Description,Location,Make,Driver State,Time Of Stop,Gender,Violation Type
count,742152,742158,742119,742149,742159,742159,742159
unique,9131,121997,2660,66,1440,3,4
top,DRIVER FAILURE TO OBEY PROPERLY PLACED TRAFFIC...,IS 370 @ IS 270,TOYOTA,MD,23:20:00,M,Citation
freq,56364,1935,79110,670475,993,503183,495956


In [8]:
data.shape


(742159, 7)

In [9]:
data.columns


Index(['Description', 'Location', 'Make', 'Driver State', 'Time Of Stop',
       'Gender', 'Violation Type'],
      dtype='object')

**Step 4 - Checking for Missing Values**

In [10]:
data.isnull().sum()


Description        7
Location           1
Make              40
Driver State      10
Time Of Stop       0
Gender             0
Violation Type     0
dtype: int64

In [11]:
data.dropna(axis=1, how='all', inplace=True)
data.dropna(inplace=True)

In [12]:
data.isnull().sum()


Description       0
Location          0
Make              0
Driver State      0
Time Of Stop      0
Gender            0
Violation Type    0
dtype: int64

In [13]:
data.dtypes


Description       object
Location          object
Make              object
Driver State      object
Time Of Stop      object
Gender            object
Violation Type    object
dtype: object

In [14]:
duplicate_rows = data.duplicated()
duplicate_rows

0         False
1         False
2         False
3         False
4         False
          ...  
742154    False
742155    False
742156    False
742157    False
742158    False
Length: 742101, dtype: bool

In [15]:
duplicate_columns = data.columns.duplicated()
duplicate_columns

array([False, False, False, False, False, False, False])

**Step 5 - Data Type Conversion**

In [16]:
data['Time Of Stop'] = pd.to_datetime('1900-01-01 ' + data['Time Of Stop'].astype(str))

# Convert categorical data to 'category' dtype to save memory and improve performance
data['Description'] = data['Description'].astype('category')
data['Location'] = data['Location'].astype('category')
data['Make'] = data['Make'].astype('category')
data['Driver State'] = data['Driver State'].astype('category')
data['Gender'] = data['Gender'].astype('category')
data['Violation Type'] = data['Violation Type'].astype('category')


**Convert time and categorical columns to appropriate data types to save memory and improve performance**

In [17]:
data.dtypes


Description             category
Location                category
Make                    category
Driver State            category
Time Of Stop      datetime64[ns]
Gender                  category
Violation Type          category
dtype: object

**Step 6 - Analysis of 'Violation Type**

In [18]:
data['Violation Type'].value_counts(normalize=True)


Citation    0.668282
ESERO       0.011439
SERO        0.000031
Name: Violation Type, dtype: float64

In [19]:
data['Violation Type'].nunique()


4

In [20]:
data['Violation Type'].value_counts()


Citation    495933
ESERO         8489
SERO            23
Name: Violation Type, dtype: int64

**Step 7 - Aggregating Description Based on Violation Type**

In [21]:
data.groupby('Violation Type')['Description'].value_counts()



Violation Type                                                                                                   
Citation        DRIVER FAILURE TO OBEY PROPERLY PLACED TRAFFIC CONTROL DEVICE INSTRUCTIONS                           27042
                PERSON DRIVING MOTOR VEHICLE ON HIGHWAY OR PUBLIC USE PROPERTY ON SUSPENDED LICENSE AND PRIVILEGE    20968
                DRIVING VEHICLE ON HIGHWAY WITH SUSPENDED REGISTRATION                                               18629
                FAILURE OF INDIVIDUAL DRIVING ON HIGHWAY TO DISPLAY LICENSE TO UNIFORMED POLICE ON DEMAND            18076
                FAILURE TO DISPLAY REGISTRATION CARD UPON DEMAND BY POLICE OFFICER                                   17491
                                                                                                                     ...  
                DRIVING VEHICLE IN EXCESS OF REASONABLE AND PRUDENT SPEED ON HIGHWAY 50/40-ICY                           0
                DRIVING V

In [22]:
data.groupby('Violation Type')['Description'].value_counts().groupby(level=0).head(3)


Violation Type                                                                                                   
Citation        DRIVER FAILURE TO OBEY PROPERLY PLACED TRAFFIC CONTROL DEVICE INSTRUCTIONS                           27042
                PERSON DRIVING MOTOR VEHICLE ON HIGHWAY OR PUBLIC USE PROPERTY ON SUSPENDED LICENSE AND PRIVILEGE    20968
                DRIVING VEHICLE ON HIGHWAY WITH SUSPENDED REGISTRATION                                               18629
ESERO           STOP LIGHTS (*)                                                                                       3189
                HEADLIGHTS (*)                                                                                        1317
                WINDOW TINT                                                                                            979
SERO            DRIVING A MOTOR VEH WITHOUT A VALID MEDICAL EXAMINERS CERTIFICATE IN POSSESSION                          6
                HEADLIGHT

**Show the top three descriptions per violation type for detailed insights**

In [23]:
numeric_columns = data.select_dtypes(include=['number']).columns
print("Numeric columns:", numeric_columns)

Numeric columns: Index([], dtype='object')


In [24]:
categorical_columns = data.select_dtypes(include=['object']).columns
print("Categorical columns:", categorical_columns)

Categorical columns: Index([], dtype='object')


**Step 8 - Determine Time of Day from Time of Stop**

In [25]:
# Function to determine if the time is day or night
def time_of_day(time):
    if 6 <= time.hour < 18:
        return 'Day'
    else:
        return 'Night'

# Apply this function to the 'Time Of Stop' column
data['Time of Day'] = data['Time Of Stop'].apply(time_of_day)


**Categorize traffic stops into day or night based on the time they occurred**

In [26]:
data.head()


Unnamed: 0,Description,Location,Make,Driver State,Time Of Stop,Gender,Violation Type,Time of Day
0,DRIVING VEHICLE ON HIGHWAY WITH SUSPENDED REGI...,8804 FLOWER AVE,FORD,MD,1900-01-01 17:11:00,M,Citation,Day
1,DRIVING WHILE IMPAIRED BY ALCOHOL,NORFOLK AVE / ST ELMO AVE,AUDI,MD,1900-01-01 00:41:00,M,Citation,Night
2,FAILURE TO STOP AT STOP SIGN,WISTERIA DR @ WARING STATION RD,TOYT,MD,1900-01-01 23:12:00,F,Citation,Night
3,DRIVER USING HANDS TO USE HANDHELD TELEPHONE W...,CLARENDON RD @ ELM ST. N/,HONDA,VA,1900-01-01 16:10:00,M,Citation,Day
4,FAILURE STOP AND YIELD AT THRU HWY,CHRISTOPHER AVE/MONTGOMERY VILLAGE AVE,HONDA,MD,1900-01-01 12:52:00,F,Citation,Day


In [27]:
data.dtypes


Description             category
Location                category
Make                    category
Driver State            category
Time Of Stop      datetime64[ns]
Gender                  category
Violation Type          category
Time of Day               object
dtype: object

**Step 9 - Conversions of the data types after the feature extraction**

In [28]:
# Convert columns to relevant data types
data['Location'] = data['Location'].astype('category')
data['Make'] = data['Make'].astype('category')
data['Driver State'] = data['Driver State'].astype('category')
data['Gender'] = data['Gender'].astype('category')
data['Violation Type'] = data['Violation Type'].astype('category')
data['Time Of Stop'] = pd.to_datetime(data['Time Of Stop'])
data['Time of Day'] = pd.to_datetime(data['Time of Day'], errors='coerce')

In [29]:
data.dtypes

Description             category
Location                category
Make                    category
Driver State            category
Time Of Stop      datetime64[ns]
Gender                  category
Violation Type          category
Time of Day       datetime64[ns]
dtype: object