### **Context**

UK police forces collect data on every vehicle collision in the uk on a form called Stats19. Data from this form ends up at the DfT and is published at https://data.gov.uk/dataset/road-accidents-safety-data

### **Content**

There are 3 CSVs in this set. Accidents is the primary one and has references by Accident_Index to the casualties and vehicles tables. This might be better done as a database.

### **Inspiration**

Questions to ask of this data -

combined with population data, how do different areas compare?
* what trends are there for accidents involving different road users eg motorcycles, peds, cyclists
* are road safety campaigns effective?
* likelihood of accidents for different groups / vehicles
* many more..

### **Manifest**

dft05-15.tgz - tar of Accidents0515.csv, Casualties0515.csv and Vehicles0515.csv
tidydata.sh - script to get and tidy data.

In [None]:
import pandas as pd
import numpy as np

import matplotlib.pyplot as plt
import seaborn as sns

from scipy import stats

import warnings
warnings.filterwarnings("ignore")

from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression, Lasso, Ridge
print("Imported")

In [None]:
sns.set_theme()

In [None]:
Accidents = pd.read_csv('../input/dft-accident-data/Accidents0515.csv',index_col='Accident_Index')
Casualities = pd.read_csv('../input/dft-accident-data/Casualties0515.csv',error_bad_lines=False,index_col='Accident_Index',warn_bad_lines=False)
Vehicles = pd.read_csv('../input/dft-accident-data/Vehicles0515.csv',error_bad_lines=False,index_col='Accident_Index',warn_bad_lines=False)
print('Datasets Imported')

### **ACCIDENTS**

In [None]:
Accidents.head()

In [None]:
print("Number of rows present in a dataset    :{}".format(Accidents.shape[0]))
print("Number of columns present in a dataset :{}".format(Accidents.shape[1]))

In [None]:
Accidents.info()

In [None]:
Accidents["Date"] = Accidents["Date"].astype("datetime64[ns]")
Accidents["Day"] = Accidents["Date"].dt.day
Accidents["Month"] = Accidents["Date"].dt.month
Accidents["Year"] = Accidents["Date"].dt.year

In [None]:
Accidents["Time"] = Accidents["Time"].astype("datetime64[ns]")

In [None]:
Accidents.columns

In [None]:
Accidents.isnull().sum()

In [None]:
Accidents[Accidents.isna().any(axis=1)]

In [None]:
Accidents.drop("LSOA_of_Accident_Location",axis=1,inplace = True)
Accidents.dropna(inplace=True)

In [None]:
Accidents.duplicated().sum()

In [None]:
Accidents.drop_duplicates(inplace=True)

In [None]:
Accidents.info()

In [None]:
Accidents.isnull().sum()

In [None]:
Accidents.duplicated().sum()

In [None]:
Accidents.hist(bins = 50, figsize = (20,15))
plt.show()

In [None]:
print("Number of rows present in a dataset after preprocessing     :{}".format(Accidents.shape[0]))
print("Number of columns present in a dataset after preprocessing  :{}".format(Accidents.shape[1]))

In [None]:
Accidents.corr()

In [None]:
plt.figure(figsize = (50,50))
sns.heatmap(data = Accidents.corr(),annot = True, cmap = "YlGnBu")
plt.show()

### **CASUALITIES**

In [None]:
Casualities.head()

In [None]:
print("Number of rows present in a dataset    :{}".format(Casualities.shape[0]))
print("Number of columns present in a dataset :{}".format(Casualities.shape[1]))

In [None]:
Casualities.info()

In [None]:
Casualities.isnull().sum()

In [None]:
Casualities.duplicated().sum()

In [None]:
Casualities.drop_duplicates(inplace=True)

In [None]:
Casualities.describe()

In [None]:
Casualities.hist(bins = 50, figsize = (20,15))
plt.show()

In [None]:
print("Number of rows present in a dataset after preprocessing     :{}".format(Casualities.shape[0]))
print("Number of columns present in a dataset after preprocessing  :{}".format(Casualities.shape[1]))

In [None]:
Casualities.corr()

In [None]:
plt.figure(figsize = (15,15))
sns.heatmap(data = Casualities.corr(),annot = True, cmap = "YlGnBu")
plt.show()

#### **VEHICLES**

In [None]:
Vehicles.head()

In [None]:
print("Number of rows present in a dataset    :{}".format(Vehicles.shape[0]))
print("Number of columns present in a dataset :{}".format(Vehicles.shape[1]))

In [None]:
list(Vehicles.columns)

In [None]:
Vehicles.info()

In [None]:
Vehicles.isnull().sum()

In [None]:
Vehicles.duplicated().sum()

In [None]:
Vehicles.drop_duplicates(inplace = True)

In [None]:
Vehicles.describe()

In [None]:
Vehicles.hist(bins = 50, figsize = (20,15))
plt.show()

In [None]:
print("Number of rows present in a dataset after preprocessing     :{}".format(Vehicles.shape[0]))
print("Number of columns present in a dataset after preprocessing  :{}".format(Vehicles.shape[1]))

In [None]:
Vehicles.corr()

In [None]:
plt.figure(figsize = (20,15))
sns.heatmap(data = Vehicles.corr(),annot = True, cmap = "YlGnBu")
plt.show()

In [None]:
Vehicles["Vehicle_Reference"].sort_values(ascending=True)

In [None]:
Casualities["Vehicle_Reference"].sort_values(ascending = True)

In [None]:
#cas_veh_merge = pd.merge(Casualities, Vehicles, how = "outer", on = "Vehicle_Reference")

### **ANALYSIS**

In [None]:
plt.figure(figsize = (20,10))
plt.subplot(1,2,1)
sns.scatterplot(data=Accidents, x = "Longitude", y = "Latitude", color = "crimson", alpha = 0.2)

plt.subplot(1,2,2)
sns.scatterplot(data=Accidents, x = "Longitude", y = "Latitude", hue = "Accident_Severity", palette = "winter")
plt.show()

In [None]:
Accidents.plot(kind = "scatter", x = "Longitude", y = "Latitude", alpha = 0.5,
             s = Accidents["Number_of_Casualties"]/100, label = "Number_of_Casualties", figsize=(15,15),
             c = "Accident_Severity", cmap = plt.get_cmap("YlGnBu"), colorbar= True
             )
plt.legend()

In [None]:
plt.figure(figsize=(15,7))
ax=sns.countplot(Accidents['Accident_Severity'], palette = "YlGnBu")
plt.title('ACCIDENT SEVERITY', fontsize=15)
ax.set_xticklabels(['low','normal','high'])
plt.grid(alpha=0.4)

In [None]:
plt.figure(figsize=(15,7))
ax = sns.countplot('Road_Type',hue='Accident_Severity',data=Accidents, 
                   order = Accidents["Road_Type"].value_counts().index, palette = "YlGnBu")
ax.set_xticklabels(['Single carriageway','Dual_carriageway','Roundabout',
                    'One_way_street ',
                    'Slip road','Unknown'])
plt.legend(['Fatal','Serious','Slight'])
plt.ylabel("Frequency", fontsize = 14)
plt.show()

In [None]:
plt.figure(figsize=(15,7))
ax=sns.countplot('Light_Conditions',data=Accidents,
                 order = Accidents["Light_Conditions"].value_counts().index,
                 palette = "YlGnBu") 
ax.set_xticklabels(['Daylight','Darkness - lights lit',
                    'Darkness - no lighting',
                    'Darkness - lighting unknown',
                    'Darkness - lights unlit'])
plt.title('ACCIDENT RATES BASED ON LIGHT CONDITIONS',fontsize=15)
plt.ylabel("Frequency", fontsize = 14)
plt.show()

In [None]:
plt.figure(figsize=(50,15))
sns.countplot('Age_of_Casualty',data=Casualities, palette = "YlGnBu")
plt.title('CASUALITY DISTRIBUTION BASED ON AGE', fontsize=15)
plt.ylabel("frequency", fontsize = 12)
plt.show()

In [None]:
#Distrubution of casualities based on age:
plt.figure(figsize=(50,15))
sns.countplot(x = 'Age_of_Casualty',data=Casualities, 
              order = Casualities["Age_of_Casualty"].value_counts().index, 
              palette = "YlGnBu")
plt.title('CASUALITY DISTRIBUTION BASED ON AGE', fontsize=15)
plt.ylabel("frequency", fontsize = 12)
plt.show()

In [None]:
plt.figure(figsize=(25,40))
sns.countplot(y = 'Age_of_Casualty', hue = "Sex_of_Casualty",
              data=Casualities, 
              order = Casualities["Age_of_Casualty"].value_counts().index, 
              palette = "YlGnBu")
plt.title('CASUALITY DISTRIBUTION BASED ON AGE', fontsize=15)
plt.legend(['other','Male','Female'],prop={'size': 30}, loc=0)
plt.xlabel("frequency", fontsize = 12)
plt.show()

In [None]:
plt.figure(figsize = (15,7))
ax=sns.countplot('Casualty_Class', data=Casualities, palette = "YlGnBu",order = Casualities["Casualty_Class"].value_counts().index)
ax.set_xticklabels(['Passenger','Driver_or_Rider','Pedestrian'])
plt.show()

In [None]:
plt.figure(figsize=(15,7))
ax=sns.countplot(Casualities['Sex_of_Casualty'], palette = "YlGnBu")
plt.title('CASUALITY DISTRIBUTION BASED ON SEX', fontsize=15)
ax.set_xticklabels(['other','Male','Female'])
plt.grid(alpha=0.4)

### **GEO-ANALYSIS USING GMAPS**

In [None]:
import gmaps

gmaps.configure(api_key = "Use your API key here")

In [None]:
locations = Accidents[["Longitude","Latitude"]]
weights_1 = Accidents["Accident_Severity"]
fig = gmaps.figure()
fig.add_layer(gmaps.heatmap_layer(locations, weights = weights_1))
fig

Note: Try the above gmap code with external notebooks since kaggle notebooks doesn't it.

### **GEO-ANALYSIS USING FOLIUM**

In [None]:
import folium
from folium.plugins import MarkerCluster

Accidents = Accidents[["Longitude","Latitude","Accident_Severity"]][:].dropna()
locationlist = Accidents[["Longitude","Latitude"]].values.tolist()


m = folium.Map(location=[51.5085300,-0.1257400], tiles='openstreetmap', zoom_start=15)
marker_cluster = MarkerCluster().add_to(m)
for i in range(0,len(locationlist)):
    folium.CircleMarker(locationlist[i],radius = float(Accidents["Accident_Severity"].values[0]/1e7),
                        popup="Accident Severity : %s"%Accidents["Accident_Severity"].values[0],color="red",fill_color='red').add_to(m)
m

Note: Try the above folium code with external notebooks since kaggle notebooks doesn't it.

From the above analyses, it is very much clear that most of the accidents which occured within the provided region comes under the age group of 17-21, and notably most of them are falls under the male category. Furthermore, it clearly shows that most of the accidents are occured in the daylight time at single carriageway. Most of the victims of these car accidents are drivers and most of them are males followed by females

#### ***Thank you for your time...!!***

#### ***Wear Mask and Stay Safe!***