In [None]:
import warnings
import pandas as pd

# settings
pd.set_option('mode.chained_assignment', None)
warnings.filterwarnings("ignore", category=DeprecationWarning)

# imbalance
from imblearn.under_sampling import RandomUnderSampler
from imblearn.over_sampling import RandomOverSampler

# models
from statsmodels.miscmodels.ordinal_model import OrderedModel

In [2]:
df = pd.read_csv("C:/Users/Настя/YandexDisk-n4skolesnikova/HSE 4th year/Graduation Thesis/data/ACCIDENT_LEVEL_DATA.csv")
df = df.drop('Unnamed: 0', axis=1)
print(df.shape)
df.head()

(440127, 75)


Unnamed: 0,REGION,DATE,COORD_L,COORD_W,road_name,road_category,n_VEHICLES,n_PARTICIPANTS,ID,n_DEATHS,...,severity,YEAR,MONTH,WEEKDAY,SEASON,is_WEEKEND,HOUR,is_NIGHT,is_PEAK_HOUR,is_toll
0,1,31.01.2015,81.151944,53.74,Романово - Завьялово - Баево - Камень-на-Оби,5.0,1,3,161242174,0,...,2,2015,1,5,1,1,9,0,1,0
1,1,30.01.2015,85.018056,51.684444,Куяган - Куяча - Тоурак,6.0,2,3,161105683,0,...,2,2015,1,4,1,0,14,0,0,0
2,1,30.01.2015,81.25,53.818056,Барнаул - Камень-на-Оби - граница Новосибирско...,5.0,2,3,161763431,0,...,1,2015,1,4,1,0,17,0,1,0
3,1,24.01.2015,51.0,84.0,Быканов Мост - Солоновка - Солонешное - границ...,7.0,1,2,160331994,0,...,1,2015,1,5,1,1,19,0,0,0
4,1,23.01.2015,84.0,53.0,Быканов Мост - Солоновка - Солонешное - границ...,7.0,1,2,160213415,1,...,3,2015,1,4,1,0,21,0,0,0


In [3]:
df.columns

Index(['REGION', 'DATE', 'COORD_L', 'COORD_W', 'road_name', 'road_category',
       'n_VEHICLES', 'n_PARTICIPANTS', 'ID', 'n_DEATHS', 'n_INJURED',
       'vehicle_failure', 'non_private_vehicle', 'russian_vehicle',
       'white_vehicle', 'black_vehicle', 'colored_vehicle', 'drunk_driver',
       'female_driver', 'escaped', 'no_seatbelt_injury', 'n_drunk',
       'n_children', 'n_cyclists', 'n_pedestrians', 'vehicle_age_min',
       'vehicle_age_max', 'vehicle_age_avg', 'n_class_a', 'n_class_b',
       'n_class_c', 'n_class_d', 'n_class_e', 'n_class_s', 'n_front_drive',
       'n_rear_drive', 'n_4wd', 'n_guilty', 'guilty_share',
       'n_fatal_violations', 'guilty_exp_avg', 'exp_avg', 'road_rank_cat',
       'road_defects_cat', 'traffic_changes_cat', 'road_surface_cat',
       'TYPE_cat', 'street_rank_cat', 'weather_cat', 'adj_objects_cat',
       'cause_factors_cat', 'crossing_violation', 'impaired_driving',
       'interference_violation', 'license_violation', 'maneuver_violation',


In [None]:
# were created for panel dataset, not to accident level data
features_to_drop = ['no_lighting', 'adj_objects_interpretable', 'weather_interpretable', 'traffic_changes_bin']
df.drop(features_to_drop, axis=1, inplace=True)

In [4]:
print(f"Final check for the gaps: {df.isna().any().sum()} gaps")

Final check for the gaps: 0 gaps


### Hypothesis 2.1: 
#### «The probability of a more severe outcome of traffic accidents is higher on toll roads than on alternative free roads»


### Hypothesis 2.2: 
#### «The factors influencing the severity of traffic accidents differ for toll and free roads»

----

In [5]:
df['severity'].unique()

array([2, 1, 3], dtype=int64)

A slight reminder about the encoding of the "severity" feature before testing the hypotheses: light injuries - 1, medium injuries - 2, severe injuries - 3.

In [6]:
df_hypoth = df.copy()

# Econometrics

## Analysis of features

In [None]:
features_to_drop = [
    'vehicle_age_min',
    'vehicle_age_max',
    'guilty_exp_avg',
    'n_VEHICLES',
    'WEEKDAY',
    'n_drunk',
    'n_PARTICIPANTS',
    'vehicle_failure',
    'drunk_driver',
    'MONTH'
]

df_hypoth.drop(features_to_drop, axis=1, inplace=True)

## Multinomial Logit (MNL) regression

Let's talk about the different model specifications that we will use to test the hypotheses.

**1. Base Model (`severity ~ is_toll`)**  
The initial specification includes only the `is_toll` variable to assess its direct relationship with accident severity without the influence of other factors. This simple model provides insight into the "raw" effect of toll roads and tests whether there is a statistically significant difference between accidents on toll and free roads. Although the results of this model may be biased due to unaccounted variables, it serves as an important starting point for hypothesis testing.

**2. Geographic and Road Specifications (`+ road_category + road_rank_cat + site_objects_cat + adj_objects_cat`)**  
In the second model, we add road type and category characteristics, as well as objects located along the roads. These features are necessary to rule out situations where accident severity is explained not by the toll status of the road, but, for example, by a higher class of the road or the specifics of the road infrastructure. By controlling for these factors, we check whether the `is_toll` effect remains robust when considering the possible structural differences between toll and free roads.

**3. Incident Condition Model (`+ weather_cat + lighting_cat + road_surface_cat + ...`)**  
The third model controls for external conditions at the time of the accident: weather factors, lighting, road surface condition, seasonality, and time characteristics. These variables are directly related to the probability of severe outcomes and may be unevenly distributed between toll and free roads. By adding them to the model, we minimize the risk of misinterpreting the `is_toll` effect, for example, due to frequent night-time accidents or precipitation on certain types of roads.

**4. Behavioral Model (`+ license_violation + impaired_driving + ...`)**  
The fourth specification focuses on the behavior of the participants in the accidents and the consequences of the accidents themselves. Variables such as the presence of violations, impaired driving, and the number of casualties reflect the immediate causes of accident severity. If the `is_toll` variable remains significant even after including these features, it indicates that its influence is not mediated solely by driver behavior but has an independent effect.

**5. Full Model (all control variables)**  
The fifth model combines all the previous blocks of features into one specification, creating the most comprehensive and rigorous test of the hypothesis. This model allows us to evaluate whether the toll road effect persists after simultaneously controlling for geographic, behavioral, infrastructure, and weather conditions. If the `is_toll` variable remains statistically significant and positively correlated with accident severity, it suggests a strong and reliable effect, supporting the original hypothesis.

#### Class `'is_toll'` imbalance

Earlier, a class imbalance was found in the `'is_toll'` variable: the positive class makes up less than 7% of the entire dataset. Therefore, the results of hypothesis testing on such data will not be representative.  
To address the class imbalance problem, we will try three methods:

0. Without balancing
1. Balancing using undersampling  
2. Balancing using class weighting

In [None]:
df_hypoth['severity'] = pd.Categorical(df_hypoth['severity'], ordered=True)

### (0). Without balancing

In [9]:
X = df_hypoth[['is_toll']]
y = df_hypoth['severity']

model = OrderedModel(y, X, distr='logit')
result = model.fit(method='bfgs')

print("\n1. Ordered Logit model:")
print(result.summary())

Optimization terminated successfully.
         Current function value: 1.028197
         Iterations: 7
         Function evaluations: 10
         Gradient evaluations: 10

1. Ordered Logit model:
                             OrderedModel Results                             
Dep. Variable:               severity   Log-Likelihood:            -4.5254e+05
Model:                   OrderedModel   AIC:                         9.051e+05
Method:            Maximum Likelihood   BIC:                         9.051e+05
Date:                Sat, 03 May 2025                                         
Time:                        04:48:56                                         
No. Observations:              440127                                         
Df Residuals:                  440124                                         
Df Model:                           1                                         
                 coef    std err          z      P>|z|      [0.025      0.975]
--------------

In [12]:
X = df_hypoth[['is_toll', 'road_category', 'road_rank_cat', 'site_objects_cat', 'adj_objects_cat', 'street_rank_cat']]
y = df_hypoth['severity']

model = OrderedModel(y, X, distr='logit')
result = model.fit(method='bfgs')

print('\n2. Geographical and road specification:\n')
print(result.summary())

Optimization terminated successfully.
         Current function value: 1.023711
         Iterations: 24
         Function evaluations: 27
         Gradient evaluations: 27

2. Geographical and road specification:

                             OrderedModel Results                             
Dep. Variable:               severity   Log-Likelihood:            -4.5056e+05
Model:                   OrderedModel   AIC:                         9.011e+05
Method:            Maximum Likelihood   BIC:                         9.012e+05
Date:                Sat, 03 May 2025                                         
Time:                        04:59:31                                         
No. Observations:              440127                                         
Df Residuals:                  440119                                         
Df Model:                           6                                         
                       coef    std err          z      P>|z|      [0.025   

In [13]:
X = df_hypoth[['is_toll', 'weather_cat', 'lighting_cat', 'road_surface_cat', 'traffic_changes_cat', 'is_NIGHT', 'is_PEAK_HOUR', 'is_WEEKEND', 'SEASON', 'cause_factors_cat']]
y = df_hypoth['severity']

model = OrderedModel(y, X, distr='logit')
result = model.fit(method='bfgs')

print('\n3. Model of incident conditions:\n')
print(result.summary())

Optimization terminated successfully.
         Current function value: 1.020381
         Iterations: 46
         Function evaluations: 49
         Gradient evaluations: 49

3. Model of incident conditions:

                             OrderedModel Results                             
Dep. Variable:               severity   Log-Likelihood:            -4.4910e+05
Model:                   OrderedModel   AIC:                         8.982e+05
Method:            Maximum Likelihood   BIC:                         8.984e+05
Date:                Sat, 03 May 2025                                         
Time:                        05:04:25                                         
No. Observations:              440127                                         
Df Residuals:                  440115                                         
Df Model:                          10                                         
                          coef    std err          z      P>|z|      [0.025      0

In [14]:
X = df_hypoth[['is_toll', 'wrong_way', 'pedestrian_violation', 'impaired_driving', 'maneuver_violation', 'traffic_control_violation', 'license_violation', 
               'transport_violation', 'crossing_violation', 'interference_violation', 'sudden_appearance_violation', 'other_violation', 'n_guilty']]
y = df_hypoth['severity']

model = OrderedModel(y, X, distr='logit')
result = model.fit(method='bfgs')

print('\n4. Behavioral model:\n')
print(result.summary())

Optimization terminated successfully.
         Current function value: 0.992985
         Iterations: 84
         Function evaluations: 85
         Gradient evaluations: 85

4. Behavioral model:

                             OrderedModel Results                             
Dep. Variable:               severity   Log-Likelihood:            -4.3704e+05
Model:                   OrderedModel   AIC:                         8.741e+05
Method:            Maximum Likelihood   BIC:                         8.743e+05
Date:                Sat, 03 May 2025                                         
Time:                        05:14:11                                         
No. Observations:              440127                                         
Df Residuals:                  440112                                         
Df Model:                          13                                         
                                  coef    std err          z      P>|z|      [0.025      0.975

In [15]:
X = df_hypoth[['is_toll', 'road_category', 'road_rank_cat', 'site_objects_cat', 'adj_objects_cat', 
               'weather_cat', 'lighting_cat', 'road_surface_cat', 'traffic_changes_cat', 'is_NIGHT', 'is_PEAK_HOUR', 'is_WEEKEND', 'SEASON', 
               'wrong_way', 'pedestrian_violation', 'impaired_driving', 'maneuver_violation', 'traffic_control_violation', 'license_violation', 
               'transport_violation', 'crossing_violation', 'interference_violation', 'sudden_appearance_violation', 'other_violation', 
               'TYPE_cat', 'exp_avg', 'guilty_share', 'n_fatal_violations', 'n_INJURED', 'cause_factors_cat', 'n_guilty']]
y = df_hypoth['severity']

model = OrderedModel(y, X, distr='logit')
result = model.fit(method='bfgs')

print('\n5. Full model (all control variables):\n')
print(result.summary())

Optimization terminated successfully.
         Current function value: 0.981128
         Iterations: 118
         Function evaluations: 121
         Gradient evaluations: 121

5. Full model (all control variables):

                             OrderedModel Results                             
Dep. Variable:               severity   Log-Likelihood:            -4.3182e+05
Model:                   OrderedModel   AIC:                         8.637e+05
Method:            Maximum Likelihood   BIC:                         8.641e+05
Date:                Sat, 03 May 2025                                         
Time:                        05:31:53                                         
No. Observations:              440127                                         
Df Residuals:                  440094                                         
Df Model:                          31                                         
                                  coef    std err          z      P>|z|  

### (1). Undersampling

In [18]:
X = df_hypoth.drop(columns=["is_toll"])
y = df_hypoth["is_toll"]

rus = RandomUnderSampler(random_state=42)
X_resampled, y_resampled = rus.fit_resample(X, y)

df_balanced = pd.concat([X_resampled, y_resampled], axis=1)
df_balanced.shape

(59080, 63)

In [21]:
X = df_balanced[['is_toll', 'road_category', 'road_rank_cat', 'site_objects_cat', 'adj_objects_cat', 
               'weather_cat', 'lighting_cat', 'road_surface_cat', 'traffic_changes_cat', 'is_NIGHT', 'is_PEAK_HOUR', 'is_WEEKEND', 'SEASON', 
               'wrong_way', 'pedestrian_violation', 'impaired_driving', 'maneuver_violation', 'traffic_control_violation', 'license_violation', 
               'transport_violation', 'crossing_violation', 'interference_violation', 'sudden_appearance_violation', 'other_violation', 
               'TYPE_cat', 'exp_avg', 'guilty_share', 'n_fatal_violations', 'n_INJURED', 'cause_factors_cat', 'n_guilty']]
y = df_balanced['severity']

model = OrderedModel(y, X, distr='logit')
result = model.fit(method='bfgs')

print('\n5. Full model (undersampling):\n')
print(result.summary())

Optimization terminated successfully.
         Current function value: 0.988753
         Iterations: 122
         Function evaluations: 126
         Gradient evaluations: 126

5. Full model (undersampling):

                             OrderedModel Results                             
Dep. Variable:               severity   Log-Likelihood:                -58416.
Model:                   OrderedModel   AIC:                         1.169e+05
Method:            Maximum Likelihood   BIC:                         1.172e+05
Date:                Sat, 03 May 2025                                         
Time:                        05:40:27                                         
No. Observations:               59080                                         
Df Residuals:                   59047                                         
Df Model:                          31                                         
                                  coef    std err          z      P>|z|      [0.0

### (2). Sample weights

In [24]:
X = df_hypoth.drop(columns=["is_toll"])
y = df_hypoth["is_toll"]

ros = RandomOverSampler(random_state=42)
X_resampled, y_resampled = ros.fit_resample(X, y)

df_weighted = pd.concat([X_resampled, y_resampled], axis=1)
df_weighted.shape

(821174, 63)

In [25]:
X = df_weighted[['is_toll', 'road_category', 'road_rank_cat', 'site_objects_cat', 'adj_objects_cat', 
               'weather_cat', 'lighting_cat', 'road_surface_cat', 'traffic_changes_cat', 'is_NIGHT', 'is_PEAK_HOUR', 'is_WEEKEND', 'SEASON', 
               'wrong_way', 'pedestrian_violation', 'impaired_driving', 'maneuver_violation', 'traffic_control_violation', 'license_violation', 
               'transport_violation', 'crossing_violation', 'interference_violation', 'sudden_appearance_violation', 'other_violation', 
               'TYPE_cat', 'exp_avg', 'guilty_share', 'n_fatal_violations', 'n_INJURED', 'cause_factors_cat', 'n_guilty']]
y = df_weighted['severity']

model = OrderedModel(y, X, distr='logit')
result = model.fit(method='bfgs')

print('\n5. Full model (undersampling):\n')
print(result.summary())

Optimization terminated successfully.
         Current function value: 0.990017
         Iterations: 121
         Function evaluations: 125
         Gradient evaluations: 125

5. Full model (undersampling):

                             OrderedModel Results                             
Dep. Variable:               severity   Log-Likelihood:            -8.1298e+05
Model:                   OrderedModel   AIC:                         1.626e+06
Method:            Maximum Likelihood   BIC:                         1.626e+06
Date:                Sat, 03 May 2025                                         
Time:                        06:30:14                                         
No. Observations:              821174                                         
Df Residuals:                  821141                                         
Df Model:                          31                                         
                                  coef    std err          z      P>|z|      [0.0

In models without accounting for class imbalance, the variable `'is_toll'` shows a positive and statistically significant effect on accident severity across all specifications, which supports the hypothesis. When using sample weights, the effect remains stable and even strengthens, which indicates the reliability of the obtained result. In the case of applying undersampling, instability is observed: in one specification, the effect becomes negative and significant, while in others it becomes insignificant. This behavior can be explained by the loss of information due to the artificial reduction of the majority class (`'is_toll'` = 0). Given the overall stability of the result in the full and weighted approaches, the **hypothesis 2.1. of more severe consequences of accidents on toll roads receives empirical confirmation**.

### Separate models

We will evaluate the same for toll roads and free roads separately. There is a chance that the models will differ in terms of the significance and direction of the influence of factors.

In [31]:
df_toll = df_hypoth[df_hypoth['is_toll'] == 1]
df_toll.drop(['is_toll'], axis=1, inplace=True)

df_free = df_hypoth[df_hypoth['is_toll'] == 0]
df_free.drop(['is_toll'], axis=1, inplace=True)

In [32]:
X = df_toll[['road_category', 'road_rank_cat', 'site_objects_cat', 'adj_objects_cat', 
               'weather_cat', 'lighting_cat', 'road_surface_cat', 'traffic_changes_cat', 'is_NIGHT', 'is_PEAK_HOUR', 'is_WEEKEND', 'SEASON', 
               'wrong_way', 'pedestrian_violation', 'impaired_driving', 'maneuver_violation', 'traffic_control_violation', 'license_violation', 
               'transport_violation', 'crossing_violation', 'interference_violation', 'sudden_appearance_violation', 'other_violation', 
               'TYPE_cat', 'exp_avg', 'guilty_share', 'n_fatal_violations', 'n_INJURED', 'cause_factors_cat', 'n_guilty']]
y = df_toll['severity']

model = OrderedModel(y, X, distr='logit')
result = model.fit(method='bfgs')

print('\nFull model for toll roads:\n')
print(result.summary())

Optimization terminated successfully.
         Current function value: 0.998965
         Iterations: 144
         Function evaluations: 147
         Gradient evaluations: 147

Full model for toll roads:

                             OrderedModel Results                             
Dep. Variable:               severity   Log-Likelihood:                -29509.
Model:                   OrderedModel   AIC:                         5.908e+04
Method:            Maximum Likelihood   BIC:                         5.935e+04
Date:                Sat, 03 May 2025                                         
Time:                        13:08:59                                         
No. Observations:               29540                                         
Df Residuals:                   29508                                         
Df Model:                          30                                         
                                  coef    std err          z      P>|z|      [0.025  

In [33]:
X = df_free[['road_category', 'road_rank_cat', 'site_objects_cat', 'adj_objects_cat', 
               'weather_cat', 'lighting_cat', 'road_surface_cat', 'traffic_changes_cat', 'is_NIGHT', 'is_PEAK_HOUR', 'is_WEEKEND', 'SEASON', 
               'wrong_way', 'pedestrian_violation', 'impaired_driving', 'maneuver_violation', 'traffic_control_violation', 'license_violation', 
               'transport_violation', 'crossing_violation', 'interference_violation', 'sudden_appearance_violation', 'other_violation', 
               'TYPE_cat', 'exp_avg', 'guilty_share', 'n_fatal_violations', 'n_INJURED', 'cause_factors_cat', 'n_guilty']]
y = df_free['severity']

model = OrderedModel(y, X, distr='logit')
result = model.fit(method='bfgs')

print('\nFull model for free roads:\n')
print(result.summary())

Optimization terminated successfully.
         Current function value: 0.979578
         Iterations: 101
         Function evaluations: 104
         Gradient evaluations: 104

Full model for free roads:

                             OrderedModel Results                             
Dep. Variable:               severity   Log-Likelihood:            -4.0220e+05
Model:                   OrderedModel   AIC:                         8.045e+05
Method:            Maximum Likelihood   BIC:                         8.048e+05
Date:                Sat, 03 May 2025                                         
Time:                        13:21:40                                         
No. Observations:              410587                                         
Df Residuals:                  410555                                         
Df Model:                          30                                         
                                  coef    std err          z      P>|z|      [0.025  

**Conclusions.**