## Data Description:

**The dataset "Traffic_Violations_USA.csv" contains information on traffic violations in the United States, including details such as the date and time of the violation, location, type of violation, driver demographics, and enforcement-related information.**

**Data Source --- Kaggle --- "https://www.kaggle.com/datasets/felix4guti/traffic-violations-in-usa/data"**

## 1. Data Cleaning and Preparation:
**1.1 Import Data:**

In [1]:
import pandas as pd

# Load the dataset into a pandas DataFrame
file_path = 'Traffic_Violations_USA.csv'
traffic_violations_df = pd.read_csv(file_path)

# Display the first few rows of the dataframe
traffic_violations_df.head()

  traffic_violations_df = pd.read_csv(file_path)


Unnamed: 0,Date Of Stop,Time Of Stop,Agency,SubAgency,Description,Location,Latitude,Longitude,Accident,Belts,...,Charge,Article,Contributed To Accident,Race,Gender,Driver City,Driver State,DL State,Arrest Type,Geolocation
0,09/24/2013,17:11:00,MCP,"3rd district, Silver Spring",DRIVING VEHICLE ON HIGHWAY WITH SUSPENDED REGI...,8804 FLOWER AVE,,,No,No,...,13-401(h),Transportation Article,No,BLACK,M,TAKOMA PARK,MD,MD,A - Marked Patrol,
1,12/20/2012,00:41:00,MCP,"2nd district, Bethesda",DRIVING WHILE IMPAIRED BY ALCOHOL,NORFOLK AVE / ST ELMO AVE,38.983578,-77.093105,No,No,...,21-902(b1),Transportation Article,No,WHITE,M,DERWOOD,MD,MD,A - Marked Patrol,"(38.9835782, -77.09310515)"
2,07/20/2012,23:12:00,MCP,"5th district, Germantown",FAILURE TO STOP AT STOP SIGN,WISTERIA DR @ WARING STATION RD,39.16181,-77.253581,No,No,...,21-707(a),Transportation Article,No,ASIAN,F,GERMANTOWN,MD,MD,A - Marked Patrol,"(39.1618098166667, -77.25358095)"
3,03/19/2012,16:10:00,MCP,"2nd district, Bethesda",DRIVER USING HANDS TO USE HANDHELD TELEPHONE W...,CLARENDON RD @ ELM ST. N/,38.982731,-77.100755,No,No,...,21-1124.2(d2),Transportation Article,No,HISPANIC,M,ARLINGTON,VA,VA,A - Marked Patrol,"(38.9827307333333, -77.1007551666667)"
4,12/01/2014,12:52:00,MCP,"6th district, Gaithersburg / Montgomery Village",FAILURE STOP AND YIELD AT THRU HWY,CHRISTOPHER AVE/MONTGOMERY VILLAGE AVE,39.162888,-77.229088,No,No,...,21-403(b),Transportation Article,No,BLACK,F,UPPER MARLBORO,MD,MD,A - Marked Patrol,"(39.1628883333333, -77.2290883333333)"


**The above code cell imports the dataset "Traffic_Violations_USA.csv" into a pandas DataFrame for further analysis.**

**1.2 Initial Exploration:**

**The below code cells provides an initial overview of the dataset using functions like df.info(), and df.describe() to understand its structure, types, and basic statistics.**

In [2]:
# Display the info of the dataframe
print("\nDataframe info:")
print(traffic_violations_df.info())

# Display basic statistical insights of the dataframe
print("\nBasic statistical insights:")
traffic_violations_df.describe()


Dataframe info:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1018634 entries, 0 to 1018633
Data columns (total 35 columns):
 #   Column                   Non-Null Count    Dtype  
---  ------                   --------------    -----  
 0   Date Of Stop             1018634 non-null  object 
 1   Time Of Stop             1018634 non-null  object 
 2   Agency                   1018634 non-null  object 
 3   SubAgency                1018634 non-null  object 
 4   Description              1018625 non-null  object 
 5   Location                 1018632 non-null  object 
 6   Latitude                 933599 non-null   float64
 7   Longitude                933599 non-null   float64
 8   Accident                 1018634 non-null  object 
 9   Belts                    1018634 non-null  object 
 10  Personal Injury          1018634 non-null  object 
 11  Property Damage          1018634 non-null  object 
 12  Fatal                    1018634 non-null  object 
 13  Commercial License       

Unnamed: 0,Latitude,Longitude,Year
count,933599.0,933599.0,1012208.0
mean,39.070965,-77.099552,2004.325
std,1.273985,1.139822,84.5761
min,-94.610988,-77.820825,0.0
25%,39.031208,-77.195098,2001.0
50%,39.074158,-77.093166,2005.0
75%,39.138796,-77.042386,2010.0
max,40.111822,41.54316,9999.0


In [3]:
# Identify missing values in the dataset
missing_values = traffic_violations_df.isnull().sum()
missing_values


Date Of Stop                   0
Time Of Stop                   0
Agency                         0
SubAgency                      0
Description                    9
Location                       2
Latitude                   85035
Longitude                  85035
Accident                       0
Belts                          0
Personal Injury                0
Property Damage                0
Fatal                          0
Commercial License             0
HAZMAT                         0
Commercial Vehicle             0
Alcohol                        0
Work Zone                      0
State                         59
VehicleType                    0
Year                        6426
Make                          48
Model                        169
Color                      13591
Violation Type                 0
Charge                         0
Article                    52065
Contributed To Accident        0
Race                           0
Gender                         0
Driver Cit

**1.3 Handle Missing Values:**

In [4]:
from sklearn.impute import SimpleImputer

# Impute numerical columns with the median and categorical columns with the mode
numerical_imputer = SimpleImputer(strategy='median')
traffic_violations_df[['Latitude', 'Longitude', 'Year']] = numerical_imputer.fit_transform(traffic_violations_df[['Latitude', 'Longitude', 'Year']])

mode_imputer = SimpleImputer(strategy='most_frequent')
categorical_columns = ['Description', 'Location', 'State', 'Make', 'Model', 'Color', 'Driver City', 'Driver State', 'DL State', 'Article']
traffic_violations_df[categorical_columns] = mode_imputer.fit_transform(traffic_violations_df[categorical_columns])

# Mark missing 'Geolocation' as 'Unknown'
traffic_violations_df['Geolocation'].fillna('Unknown', inplace=True)

# Verify the imputation results
traffic_violations_df.isnull().sum()


Date Of Stop               0
Time Of Stop               0
Agency                     0
SubAgency                  0
Description                0
Location                   0
Latitude                   0
Longitude                  0
Accident                   0
Belts                      0
Personal Injury            0
Property Damage            0
Fatal                      0
Commercial License         0
HAZMAT                     0
Commercial Vehicle         0
Alcohol                    0
Work Zone                  0
State                      0
VehicleType                0
Year                       0
Make                       0
Model                      0
Color                      0
Violation Type             0
Charge                     0
Article                    0
Contributed To Accident    0
Race                       0
Gender                     0
Driver City                0
Driver State               0
DL State                   0
Arrest Type                0
Geolocation   

**The above code cell identifies missing values in the dataset using df.isnull().sum() and then handles them through imputation strategies such as mean or median for numerical columns and mode for categorical columns.**

**1.4 Remove Duplicate Rows:**

In [5]:
# Check for duplicate rows and remove them
traffic_violations_df.drop_duplicates(inplace=True)

# Verify the removal by checking the shape of the DataFrame
traffic_violations_df.shape

(1017322, 35)

**This code checks for and removes duplicate entries in the dataset to maintain data integrity using the drop_duplicates() function.**

**1.5 Data Type Correction:**

In [6]:
# Convert date/time columns to datetime format
traffic_violations_df['Date Of Stop'] = pd.to_datetime(traffic_violations_df['Date Of Stop'])
traffic_violations_df['Time Of Stop'] = pd.to_datetime(traffic_violations_df['Time Of Stop'], format='%H:%M:%S').dt.time

# Convert categorical columns to category type
categorical_columns = ['Agency', 'SubAgency', 'Description', 'Location', 'Accident', 'Belts', 'Personal Injury', 
                       'Property Damage', 'Fatal', 'Commercial License', 'HAZMAT', 'Commercial Vehicle', 'Alcohol', 
                       'Work Zone', 'State', 'VehicleType', 'Make', 'Model', 'Color', 'Violation Type', 'Charge', 
                       'Article', 'Contributed To Accident', 'Race', 'Gender', 'Driver City', 'Driver State', 
                       'DL State', 'Arrest Type', 'Geolocation']

for col in categorical_columns:
    traffic_violations_df[col] = traffic_violations_df[col].astype('category')

# Verify the data type changes
traffic_violations_df.dtypes

Date Of Stop               datetime64[ns]
Time Of Stop                       object
Agency                           category
SubAgency                        category
Description                      category
Location                         category
Latitude                          float64
Longitude                         float64
Accident                         category
Belts                            category
Personal Injury                  category
Property Damage                  category
Fatal                            category
Commercial License               category
HAZMAT                           category
Commercial Vehicle               category
Alcohol                          category
Work Zone                        category
State                            category
VehicleType                      category
Year                              float64
Make                             category
Model                            category
Color                            c

**The above code cell corrects the data types of columns, specifically converting date/time columns to datetime format and categorical columns to category type, ensuring consistency and compatibility for analysis.**

## 2. Data Transformation and Feature Engineering:
**2.1 Encode Categorical Variables:**

In [7]:
from sklearn.preprocessing import OneHotEncoder, LabelEncoder
import numpy as np

# Initialize LabelEncoder
label_encoder = LabelEncoder()

# Apply Label Encoding for high-cardinality columns
high_cardinality_columns = ['Description', 'Location', 'Make', 'Model', 'Charge', 'Article', 'Driver City']
for col in high_cardinality_columns:
    traffic_violations_df[col] = label_encoder.fit_transform(traffic_violations_df[col])

# Select columns for OneHotEncoding (low cardinality)
low_cardinality_columns = ['Agency', 'SubAgency', 'Accident', 'Belts', 'Personal Injury', 
                           'Property Damage', 'Fatal', 'Commercial License', 'HAZMAT', 'Commercial Vehicle', 
                           'Alcohol', 'Work Zone', 'State', 'VehicleType', 'Color', 
                           'Violation Type', 'Contributed To Accident', 'Race', 'Gender', 
                           'Driver State', 'DL State', 'Arrest Type']

# Initialize OneHotEncoder with sparse output
one_hot_encoder = OneHotEncoder(sparse=True, drop='first')

# Apply OneHotEncoding to low-cardinality columns and keep the output as a sparse matrix
encoded_columns = one_hot_encoder.fit_transform(traffic_violations_df[low_cardinality_columns])

# The shape of the encoded columns can be inspected to verify the transformation
print(encoded_columns.shape)


(1017322, 303)


**Here, categorical variables are encoded using LabelEncoder for high-cardinality columns and OneHotEncoder for low-cardinality columns to prepare them for machine learning models.**

In [8]:
traffic_violations_df.columns

Index(['Date Of Stop', 'Time Of Stop', 'Agency', 'SubAgency', 'Description',
       'Location', 'Latitude', 'Longitude', 'Accident', 'Belts',
       'Personal Injury', 'Property Damage', 'Fatal', 'Commercial License',
       'HAZMAT', 'Commercial Vehicle', 'Alcohol', 'Work Zone', 'State',
       'VehicleType', 'Year', 'Make', 'Model', 'Color', 'Violation Type',
       'Charge', 'Article', 'Contributed To Accident', 'Race', 'Gender',
       'Driver City', 'Driver State', 'DL State', 'Arrest Type',
       'Geolocation'],
      dtype='object')

**2.2 Normalize/Standardize Numerical Variables**

