# Traffic congestion:

Traffic congestion constitutes a major challenge that must be addressed in urban environments to improve the quality of life for inhabitants.
Traffic congestion is measured based on several parameters such as delay, speed, length of road segments, number of vehicles, etc. Road space occupancy is also considered a measure to determine traffic congestion in a particular area.The road occupancy rate is determined by considering factors such as the number of vehicles, the length of the road segment, the length of the vehicles, and the average space between vehicles on the road. In this study, we aim to predict the road occupancy rate based on the types of vehicles that influence traffic congestion, along with the time of day, by taking into account the key contributing features.

In [16]:
##Data Wrangling 

In [2]:
#Prerequisite python packages 
import pandas as pd
import ast
import json

The json packets are validated and is properly structured and saved as a CSV. Thus the unwanted attributes, duplicates and other corrections are done for further analysis.

In [3]:

filename = "traffic_density.json"

with open(filename, 'r') as file:
    json_data = file.read()

corrected_json_data = json_data.replace('}{', '},{')

if not (corrected_json_data.startswith('[') and corrected_json_data.endswith(']')):
    corrected_json_data = '[' + corrected_json_data + ']'

json_list = json.loads(corrected_json_data)

df = pd.DataFrame(json_list)

df.to_csv("traffic_density.csv")


The attributes count_car, count_pedestrian, count_cycle, count_truck, count_threewheeler, and count_motorbike, which are nested within the vehicleTypeCount dictionary, are extracted as separate columns in the DataFrame. Subsequently, the vehicleTypeCount dictionary is removed from the DataFrame.

In [19]:
filename = "prayagraj_traffic_density.csv"
df = pd.read_csv(filename)

def convert_dict_column(dict_str):
    return ast.literal_eval(dict_str.replace("'", '"'))

df['vehicleTypeCount'] = df['vehicleTypeCount'].apply(convert_dict_column)

vehicle_type_df = df['vehicleTypeCount'].apply(pd.Series)

vehicle_type_df = vehicle_type_df.rename(columns=lambda x: 'count_' + x)

df = df.drop(columns=['vehicleTypeCount'])

df = pd.concat([df, vehicle_type_df], axis=1)

df.to_csv('prayagraj_traffic_density.csv')

'filename = "/home/iudx/pari/python/model/pryagraj_traffic_density/raw_data/traffic_density_train.csv"\ndf = pd.read_csv(filename)\n\ndef convert_dict_column(dict_str):\n    return ast.literal_eval(dict_str.replace("\'", \'"\'))\n\n# Apply the function to the vehicleTypeCount column\ndf[\'vehicleTypeCount\'] = df[\'vehicleTypeCount\'].apply(convert_dict_column)\n\n# Create a new DataFrame by expanding the dictionary into columns\nvehicle_type_df = df[\'vehicleTypeCount\'].apply(pd.Series)\n\n# Rename the columns to include the prefix \'vehicleTypeCount_\'\nvehicle_type_df = vehicle_type_df.rename(columns=lambda x: \'count_\' + x)\n\n# Drop the original vehicleTypeCount column from the original DataFrame\n#df = df.drop(columns=[\'vehicleCount\',\'id\'])\n\n# Concatenate the original DataFrame with the new DataFrame containing the expanded columns\ndf = pd.concat([df, vehicle_type_df], axis=1)'

The dataset is imported and transformed into a DataFrame. The datatype of zoneID is converted to string


In [4]:
filename = "/home/iudx/pari/python/model/pryagraj_traffic_density/prayagraj_traffic_density.csv"
df = pd.read_csv(filename)

In [9]:
df.head()

Unnamed: 0.2,Unnamed: 0.1,Unnamed: 0,vehicleTypeCount,roadID,id,vehicleCount,roadOccupancy,junctionID,observationDateTime,zoneID,junctionMode,count_car,count_pedestrian,count_cycle,count_truck,count_threewheeler,count_motorbike
0,0,0,"{'car': 0, 'pedestrian': 0, 'cycle': 0, 'truck...",L01,a4e1d614-ee93-4e37-9e40-629a3b9e4299,2.0,0.0067,J035,2024-05-10T00:25:00+05:30,1,FLSH,0,0,0,0,0,0
1,1,1,"{'car': 0, 'pedestrian': 0, 'cycle': 0, 'truck...",L02,a4e1d614-ee93-4e37-9e40-629a3b9e4299,5.0,0.0167,J035,2024-05-10T00:25:00+05:30,1,FLSH,0,0,0,0,0,0
2,2,2,"{'car': 0, 'pedestrian': 0, 'cycle': 0, 'truck...",L02,a4e1d614-ee93-4e37-9e40-629a3b9e4299,0.0,0.0,J035,2024-05-10T00:25:00+05:30,2,FLSH,0,0,0,0,0,0
3,3,3,"{'car': 0, 'pedestrian': 0, 'cycle': 0, 'truck...",L03,a4e1d614-ee93-4e37-9e40-629a3b9e4299,22.0,0.0733,J035,2024-05-10T00:25:00+05:30,1,FLSH,0,0,0,0,0,0
4,4,4,"{'car': 0, 'pedestrian': 0, 'cycle': 0, 'truck...",L03,a4e1d614-ee93-4e37-9e40-629a3b9e4299,43.0,0.1433,J035,2024-05-10T00:25:00+05:30,2,FLSH,0,0,0,0,0,0


In [11]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 932750 entries, 0 to 932749
Data columns (total 17 columns):
 #   Column               Non-Null Count   Dtype  
---  ------               --------------   -----  
 0   Unnamed: 0.1         932750 non-null  int64  
 1   Unnamed: 0           932750 non-null  int64  
 2   vehicleTypeCount     932750 non-null  object 
 3   roadID               932750 non-null  object 
 4   id                   932750 non-null  object 
 5   vehicleCount         932750 non-null  float64
 6   roadOccupancy        932750 non-null  float64
 7   junctionID           932750 non-null  object 
 8   observationDateTime  932750 non-null  object 
 9   zoneID               932750 non-null  int64  
 10  junctionMode         932750 non-null  object 
 11  count_car            932750 non-null  int64  
 12  count_pedestrian     932750 non-null  int64  
 13  count_cycle          932750 non-null  int64  
 14  count_truck          932750 non-null  int64  
 15  count_threewheele

In [37]:
df.zoneID = df.zoneID.astype(str)

In [25]:
df.describe()

Unnamed: 0.2,Unnamed: 0.1,Unnamed: 0,vehicleCount,roadOccupancy,count_car,count_pedestrian,count_cycle,count_truck,count_threewheeler,count_motorbike
count,932750.0,932750.0,932750.0,932750.0,932750.0,932750.0,932750.0,932750.0,932750.0,932750.0
mean,466374.5,466374.5,27.055381,0.090184,8.482146,0.0,0.0,2.153927,0.0,10.198421
std,269261.876132,269261.876132,39.796204,0.132654,16.297627,0.0,0.0,5.358592,0.0,20.973155
min,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,233187.25,233187.25,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
50%,466374.5,466374.5,8.0,0.0267,0.0,0.0,0.0,0.0,0.0,0.0
75%,699561.75,699561.75,40.0,0.1333,10.0,0.0,0.0,2.0,0.0,10.0
max,932749.0,932749.0,454.0,1.5133,176.0,0.0,0.0,120.0,0.0,301.0


In [7]:
df.shape

(932750, 17)

In [8]:
df.isna().sum()

Unnamed: 0.1           0
Unnamed: 0             0
vehicleTypeCount       0
roadID                 0
id                     0
vehicleCount           0
roadOccupancy          0
junctionID             0
observationDateTime    0
zoneID                 0
junctionMode           0
count_car              0
count_pedestrian       0
count_cycle            0
count_truck            0
count_threewheeler     0
count_motorbike        0
dtype: int64

In [14]:
#Checking the attributes that contain only zero values which does not make any importance for the analysis.
int_float_columns = df.select_dtypes(include=['int64', 'float64'])

columns_with_only_zero = [col for col in int_float_columns.columns if (df[col] == 0).all()]
columns_with_only_zero

['count_pedestrian', 'count_cycle', 'count_threewheeler']

Filtering out unnecessary features from the dataset. Features such as count_threewheeler, count_pedestrian, and count_cycle have zero values due to unavailability of data from the smart city, so these are removed from the dataset as they do not add any value to the model.

In [5]:
df = df.drop(columns = ['vehicleTypeCount','Unnamed: 0.1','Unnamed: 0','id','count_threewheeler','count_pedestrian','count_cycle'])

To check the number of duplicates in the training dataset

In [40]:
df.duplicated().sum() 

165

In [6]:
#Removing the duplicates
df = df.drop_duplicates()

In [7]:
# Converting the observationDateTime from string to datetime

df['observationDateTime'] = pd.to_datetime(df['observationDateTime'])


Extracting the hour, day and month from the date time parameter for the analysis

In [8]:
df['hour'] = df['observationDateTime'].dt.hour
df['weekday'] = df['observationDateTime'].dt.dayofweek  
df['month'] = df['observationDateTime'].dt.month

In [9]:
# Aggregating the data by hourly intervals
df = df.groupby(['roadID', 'junctionID', 'hour', 'weekday', 'month']).agg({
    'vehicleCount': 'sum',
    'roadOccupancy': 'mean',
    'count_car': 'sum',
    'count_truck': 'sum',
    'count_motorbike': 'sum'
}).reset_index()


In [None]:
df.head()