# Additional Hypothesis Testing

In [1]:
import os
current_dir = os.getcwd()
current_dir

'c:\\Users\\sayed\\OneDrive\\Documents\\Code institute\\UK-Road-Accident-Analysis\\jupyter_notebooks'

In [2]:
os.chdir(os.path.dirname(current_dir))
print("You set a new current directory")

You set a new current directory


In [3]:
current_dir = os.getcwd()
current_dir

'c:\\Users\\sayed\\OneDrive\\Documents\\Code institute\\UK-Road-Accident-Analysis'

In [4]:
import pandas as pd 
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
sns.set_style('whitegrid')
import plotly.express as px

In [5]:
df = pd.read_csv(os.path.join(current_dir, 'filtered_accident_data_set.csv'))
df.head()

Unnamed: 0,Index,Accident_Severity,Accident Date,Latitude,Light_Conditions,District Area,Longitude,Number_of_Casualties,Number_of_Vehicles,Road_Surface_Conditions,Road_Type,Urban_or_Rural_Area,Vehicle_Type
0,200720D003001,Slight,02-01-2019,52.513668,Darkness - lights lit,Birmingham,-1.901975,1,2,Wet or damp,Dual carriageway,Urban,Car
1,200720D003101,Slight,02-01-2019,52.502396,Daylight,Birmingham,-1.867086,1,2,Wet or damp,Single carriageway,Urban,Car
2,200720D003802,Serious,03-01-2019,52.563201,Daylight,Birmingham,-1.822793,1,1,Dry,Single carriageway,Urban,Car
3,200720D005801,Slight,02-01-2019,52.493431,Daylight,Birmingham,-1.818507,1,2,Wet or damp,Dual carriageway,Urban,Car
4,200720D005901,Slight,05-01-2019,52.510805,Darkness - lights lit,Birmingham,-1.834202,1,3,Dry,Dual carriageway,Urban,Car


---

## Handling Mixed effects models

The mixed linear regression model is used to analyze how factors like month and road surface conditions affect accident counts while considering differences between districts. This approach helps account for patterns in the data and variations across different areas, giving a clearer understanding of what influences accidents.

In [None]:
import statsmodels.formula.api as smf

# Ensure 'Month' exists in the DataFrame
df['Month'] = pd.to_datetime(df['Accident Date'], format="%d-%m-%Y").dt.month

# Group by relevant columns to calculate the number of accidents
df_grouped = df.groupby(['District Area', 'Month', 'Road_Surface_Conditions', 'Road_Type']).size().reset_index(name='Accidents')

# Create a mixed-effects model with your specified fixed effects
model = smf.mixedlm("Accidents ~ Month + Road_Surface_Conditions + Road_Type",
                    df_grouped, groups=df_grouped["District Area"])

# Fit the model
result = model.fit()

# Print the model summary
print(result.summary())

                             Mixed Linear Model Regression Results
Model:                          MixedLM              Dependent Variable:              Accidents 
No. Observations:               1048                 Method:                          REML      
No. Groups:                     8                    Scale:                           2622.7260 
Min. group size:                118                  Log-Likelihood:                  -5597.2902
Max. group size:                167                  Converged:                       Yes       
Mean group size:                131.0                                                           
------------------------------------------------------------------------------------------------
                                                 Coef.   Std.Err.    z    P>|z|  [0.025   0.975]
------------------------------------------------------------------------------------------------
Intercept                                         51.555   1

We'll check for colinearity between the variables

In [None]:
from statsmodels.stats.outliers_influence import variance_inflation_factor

# Dummy encode categorical variables
X = pd.get_dummies(df_grouped[['Month', 'Road_Surface_Conditions', 'Road_Type']], drop_first=True)

# Convert all boolean columns to integers
X = X.astype(int)

# Calculate VIF for each variable
vif_data = pd.DataFrame()
vif_data["Variable"] = X.columns
vif_data["VIF"] = [variance_inflation_factor(X.values, i) for i in range(X.shape[1])]

print(vif_data)

                                       Variable       VIF
0                                         Month  2.685577
1  Road_Surface_Conditions_Flood over 3cm. deep  1.064821
2          Road_Surface_Conditions_Frost or ice  1.204965
3                  Road_Surface_Conditions_Snow  1.118458
4           Road_Surface_Conditions_Wet or damp  1.620760
5                      Road_Type_One way street  1.258437
6                          Road_Type_Roundabout  1.443385
7                  Road_Type_Single carriageway  1.606593
8                           Road_Type_Slip road  1.275193


All VIF values are below 5, which means there is no significant multicollinearity among the variables so we can confidently include them in our regression model

### Interpretation

The analysis revealed an unexpected finding: while the model showed a statistically significant effect of road surface conditions on accident frequency, the results suggest that wet or icy conditions do not lead to a substantial increase in accidents. This counterintuitive result may be attributed to several factors:

* Firstly, there was a data imbalance, with relatively few accidents occurring under icy conditions, which may limit the model's power to detect a significant effect. 
* Secondly, it is plausible that confounding factors, such as more cautious driving behavior in adverse conditions, may mitigate the increased risk associated with those conditions. Drivers may reduce their speed and increase their following distance, thus reducing the likelihood of accidents. 

Lets check if there is significant interaction between road surface conditions and the road type using the Chi-Squared test

In [33]:
from scipy.stats import chi2_contingency

# Create a contingency table for Road_Surface_Conditions and Road_Type
contingency_table = pd.crosstab(df['Road_Surface_Conditions'], df['Road_Type'])

# Perform the Chi-Squared Test
chi2, p, dof, expected = chi2_contingency(contingency_table)

# Print the results
print("Chi-Squared Statistic:", chi2)
print("P-Value:", p)
print("Degrees of Freedom:", dof)
print("Expected Frequencies:")
print(expected)

# Interpret the p-value
if p < 0.05:
    print("There is a significant association between Road_Surface_Conditions and Road_Type.")
else:
    print("There is no significant association between Road_Surface_Conditions and Road_Type.")

Chi-Squared Statistic: 22.454283308155567
P-Value: 0.12911959117995064
Degrees of Freedom: 16
Expected Frequencies:
[[4.34414927e+03 3.96106369e+02 1.58579609e+03 1.50088678e+04
  2.48080460e+02]
 [4.62935797e+00 4.22112148e-01 1.68990919e+00 1.59942529e+01
  2.64367816e-01]
 [1.46931797e+02 1.33974725e+01 5.36362482e+01 5.07643678e+02
  8.39080460e+00]
 [4.95140027e+01 4.51476472e+00 1.80746809e+01 1.71068966e+02
  2.82758621e+00]
 [1.79377558e+03 1.63559281e+02 6.54803074e+02 6.19742529e+03
  1.02436782e+02]]
There is no significant association between Road_Surface_Conditions and Road_Type.


Since there is no significant association between the two variables we can stick with our evaluation above