# Non-Graphical Exploratory Data Analysis Notebook

## Objectives

* Conduct statistical testing to explore and understand the nature of the dataset in terms of its distribution and tendency, idenfity anomalies present, and get a general overview of the data's patterns.

## Inputs

* Dateset: `filtered_accident_data_set.csv`

## Outputs

* Summary statistics
* Frequency counts
* Statistical tests



# Change working directory

* We are assuming you will store the notebooks in a subfolder, therefore when running the notebook in the editor, you will need to change the working directory

We need to change the working directory from its current folder to its parent folder
* We access the current directory with os.getcwd()

In [1]:
import os
current_dir = os.getcwd()
current_dir

'/Users/mahahussain/Desktop/UK-Road-Accident-Analysis/UK-Road-Accident-Analysis/jupyter_notebooks'

We want to make the parent of the current directory the new current directory
* os.path.dirname() gets the parent directory
* os.chir() defines the new current directory

In [2]:
os.chdir(os.path.dirname(current_dir))
print("You set a new current directory")

You set a new current directory


Confirm the new current directory

In [3]:
current_dir = os.getcwd()
current_dir

'/Users/mahahussain/Desktop/UK-Road-Accident-Analysis/UK-Road-Accident-Analysis'

# Section 1: Understanding the data

In [4]:
# Import required libaries 
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

Before conducting statistical and graphical analysis, we must first inspect the dataset itself. 

In [5]:
# Load the data 
df = pd.read_csv("filtered_accident_data_set.csv") 

# Display basic structure 
print("Dataset Shape:", df.shape)  # Provides the number of rows and columns in the dataset
print("First few rows:\n", df.head()) # Displays the first few rows of the dataset

Dataset Shape: (32657, 13)
First few rows:
            Index Accident_Severity Accident Date   Latitude  \
0  200720D003001            Slight    02-01-2019  52.513668   
1  200720D003101            Slight    02-01-2019  52.502396   
2  200720D003802           Serious    03-01-2019  52.563201   
3  200720D005801            Slight    02-01-2019  52.493431   
4  200720D005901            Slight    05-01-2019  52.510805   

        Light_Conditions District Area  Longitude  Number_of_Casualties  \
0  Darkness - lights lit    Birmingham  -1.901975                     1   
1               Daylight    Birmingham  -1.867086                     1   
2               Daylight    Birmingham  -1.822793                     1   
3               Daylight    Birmingham  -1.818507                     1   
4  Darkness - lights lit    Birmingham  -1.834202                     1   

   Number_of_Vehicles Road_Surface_Conditions           Road_Type  \
0                   2             Wet or damp    Dual car

Our cleaned and filtered dataset amounts to 31,494 rows, with 13 columns. Using the head() function, we can see the first few rows. However, these functions alone do not tell us about the completeness of the dataset or whether the data types have been correctly set. Therefore, we will use the Pandas `info()` function to get a better look at the structure of the dataset.

In [6]:
# summary of datatypes, non-null's and memory usage
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 32657 entries, 0 to 32656
Data columns (total 13 columns):
 #   Column                   Non-Null Count  Dtype  
---  ------                   --------------  -----  
 0   Index                    32657 non-null  object 
 1   Accident_Severity        32657 non-null  object 
 2   Accident Date            32657 non-null  object 
 3   Latitude                 32657 non-null  float64
 4   Light_Conditions         32657 non-null  object 
 5   District Area            32657 non-null  object 
 6   Longitude                32657 non-null  float64
 7   Number_of_Casualties     32657 non-null  int64  
 8   Number_of_Vehicles       32657 non-null  int64  
 9   Road_Surface_Conditions  32657 non-null  object 
 10  Road_Type                32657 non-null  object 
 11  Urban_or_Rural_Area      32657 non-null  object 
 12  Vehicle_Type             32657 non-null  object 
dtypes: float64(2), int64(2), object(9)
memory usage: 3.2+ MB


From this output, the following is evident:
- The majority of the columns (9 out of 13) are categorical, as indicated by the `object` data type.
- None of the columns contain null values which makes it ideal for analysis.
- The Accident Date column is incorrectly set as an `object` type. This will need to be converted to a datetime format.

In [7]:
# Convert the 'Accident Date' column to datetime format
df['Accident Date'] = pd.to_datetime(df['Accident Date'], dayfirst=True)

# dayfirst=True explicitly specifies the UK date format (day/month/year). Default is month/day/year.

df['Accident Date'].dtype

dtype('<M8[ns]')

Here we can see the column is now set to a datetime format.

To verify that the recorded date is correct and in the UK format, we can extract the constituent parts.

In [8]:
date_to_check = df['Accident Date'].iloc[0] # Get the first date in the column
print(date_to_check) # Display the date in the column

# Extract the day, month and year from the date to verify the conversion in UK date format
df['Accident Date'].iloc[0].strftime('%d %B %Y')


2019-01-02 00:00:00


'02 January 2019'

## Univariate

Univariate analysis is focused on inspecting one given variable at a time. Its main purpose is to serve as a summary of the data column's central tendency and the level of variation within it.

### Summary Statistics (numerical)
- describe(): gives an overview of the dataset's key statistics, helping to understand its distribution and characteristics.

In [9]:
df.describe()

Unnamed: 0,Accident Date,Latitude,Longitude,Number_of_Casualties,Number_of_Vehicles
count,32657,32657.0,32657.0,32657.0,32657.0
mean,2020-11-14 14:26:46.148758528,52.483689,-1.868548,1.373304,1.848945
min,2019-01-01 00:00:00,52.223378,-2.198914,1.0,1.0
25%,2019-11-15 00:00:00,52.443771,-2.003216,1.0,1.0
50%,2020-10-22 00:00:00,52.482736,-1.897252,1.0,2.0
75%,2021-11-05 00:00:00,52.526943,-1.800637,1.0,2.0
max,2022-12-31 00:00:00,52.662124,-1.410728,24.0,10.0
std,,0.072148,0.19371,0.81716,0.671611


- **Count**: As expected, the counts remain identical, indicating that there are no missing values within the dataset.
  
- **Mean**: 
    - **Number of Casualties**: The mean casualty count is **1.38**, suggesting that each accident results in just over 1 casualty on average.
    - **Number of Vehicles**: The mean vehicle count is **1.85**, implying that each accident involves around 1-2 vehicles.

- **Min**:
    - **Number of Casualties**: The minimum number of casualties in any accident is **1**.
    - **Number of Vehicles**: The minimum number of vehicles involved in an accident is **1**.

- **Max**:
    - **Number of Casualties**: The maximum number of casualties in a single accident is **24**.
    - **Number of Vehicles**: The maximum number of vehicles involved in an accident is **10**.

- **Standard Deviation**:
    - **Number of Casualties**: The standard deviation is **0.82**, indicating that while most accidents involve 1 or 2 casualties, there are a few with significantly more.
    - **Number of Vehicles**: The standard deviation is **0.67**, suggesting that most accidents involve around 1-2 vehicles, with a few accidents involving more.

- **Percentiles**:
    - **25% (Q1)**: Most accidents (25% of the dataset) involve **1 casualty** and **1 vehicle**.
    - **50% (Median)**: The median accident involves **1 casualty** and between **1-2 vehicles**.
    - **75% (Q3)**: 75% of accidents involve **1 casualty** and **2 vehicles**.

### Measures of Central Tendency: Median & Mode


As the mean was covered in the describe() function, we will calculate the mode and median.

- Mode: to identify the most frequent value in the dataset columns.
- Median: to provide a 'typical' value of a given column in the dataset that is resistant to outliers.

The **mode** (most frequent values) in the dataset provides insights into the most common characteristics of accidents.

In [10]:
# mode of the columns
df.mode()

Unnamed: 0,Index,Accident_Severity,Accident Date,Latitude,Light_Conditions,District Area,Longitude,Number_of_Casualties,Number_of_Vehicles,Road_Surface_Conditions,Road_Type,Urban_or_Rural_Area,Vehicle_Type
0,2010000000000.0,Slight,2019-07-13,52.458798,Daylight,Birmingham,-1.871043,1,2,Dry,Single carriageway,Urban,Car


- **Accident Severity**: Slight, indicating that most accidents are categorised as slight.
- **Accident Date**: 2020-07-09
- **Latitude**: 52.458798
- **Light Conditions**: Daylight
- **District Area**: Birmingham
- **Longitude**: -1.871043
- **Number of Casualties**: 1
- **Number of Vehicles**: 2
- **Road Surface Conditions**: Dry
- **Road Type**: Single carriageway
- **Urban or Rural Area**: Urban
- **Vehicle Type**: Car

This row represents the most frequent combination of attributes in the dataset, giving an overview of the typical accident scenario based on the dataset.

In [11]:
# median for numerical values only
df.select_dtypes(include='number').median()

Latitude                52.482736
Longitude               -1.897252
Number_of_Casualties     1.000000
Number_of_Vehicles       2.000000
dtype: float64

The **median** values for the numerical columns in the dataset represent the central point (50th percentile) of the data. Below are the median values for the relevant numerical attributes:

- **Latitude**: 52.481789 (central latitude of accident locations)
- **Longitude**: -1.901860 (central longitude of accident locations)
- **Number of Casualties**: 1 (median number of casualties in accidents)
- **Number of Vehicles**: 2 (median number of vehicles involved in accidents)

These values indicate typical or central tendencies for each of the numerical columns in the dataset.

### Skewness, Kurtosis, and Shapiro-Wilk Test


We are ignoring **Longitude** and **Latitude** because they represent geographic positions, not quantities that follow a distribution we would typically assess using these statistical tests.

#### Skewness


Skewness values greater than 0 imply a right-skewed distribution, while negative values indicate a left-skewed distribution.

In [12]:
# test skewness of numerical columns
df.select_dtypes(include='number').skew()

Latitude               -0.477152
Longitude               0.730091
Number_of_Casualties    4.022197
Number_of_Vehicles      1.276229
dtype: float64

Both Number of Casualties and Number of Vehicles show positive skewness, suggesting that most accidents tend to involve fewer vehicles and casualties, with a few outliers having more.

#### Kurtosis


Kurtosis measures the "tailedness" of a distribution. A high kurtosis indicates that the data has heavy tails or outliers, while low kurtosis suggests lighter tails and fewer outliers.

In [13]:
# test kurtosis of numerical columns
df.select_dtypes(include='number').kurt()

Latitude                 0.789487
Longitude               -0.319165
Number_of_Casualties    39.557069
Number_of_Vehicles       6.322462
dtype: float64

Number of Casualties shows a high kurtosis, meaning there are some accidents with significantly more casualties than the majority.
Number of Vehicles also has moderate kurtosis, suggesting that most accidents involve fewer vehicles, but there are some accidents with notably more vehicles involved.

#### Shapiro-Wilk


In [14]:
# Shapiro-Wilk test for normality
from scipy import stats

shapiro = stats.shapiro(df.select_dtypes(include='number'))
shapiro

  res = hypotest_fun_out(*samples, **kwds)


ShapiroResult(statistic=0.5950634312505818, pvalue=6.910672547903147e-161)

The Shapiro-Wilk test result shows that the data is not normally distributed:

- **Statistic: 0.5952:** A value significantly lower than 1 indicates a substantial deviation from normality.
- **p-value:** The extremely small p-value suggests strong evidence against the null hypothesis, which states that the data is normally distributed. Since the p-value is far below the common significance level of 0.05, we reject the null hypothesis and conclude that the data does not follow a normal distribution.

This means that, moving forward, we will need to use non-parametric methods for statistical testing.

### Frequency Distribution


Frequency distribution shows how often each categorical value occurs in the dataset. It will help to summarise the diversity and structure of the dataset.

In [15]:
# Check cardinality of categorical columns
df.select_dtypes(include='object').nunique()

Index                      29902
Accident_Severity              3
Light_Conditions               5
District Area                  9
Road_Surface_Conditions        5
Road_Type                      5
Urban_or_Rural_Area            2
Vehicle_Type                  14
dtype: int64

1. Accident Severity presents three distinct categories, indicating varying levels of impact.

2. Light Conditions are classified into five categories, reflecting different levels visbility.

3. Geographical dispersion is represented by eight District Areas.

4. Road Surface Conditions are categorised into five classifications, reflecting various surface states.

5. Road Type is delineated by five categories, signifying different road infrastructures.

6. Urban or Rural Area distinguishes between two distinct environmental contexts.

7. Vehicle Type encompasses 14 categories, characterising the diverse range of vehicles involved.

In [16]:
# Create a frequency table for all the categorical columns
for column in df.select_dtypes(include='object'):
    print(df[column].value_counts())
    print("\n")

Index
2.01E+12         2756
200920G040001       1
200920G041102       1
200920G041002       1
200920G040902       1
                 ... 
200820E018803       1
200820E018401       1
200820E018301       1
200820E018201       1
201020Z720511       1
Name: count, Length: 29902, dtype: int64


Accident_Severity
Slight     28566
Serious     3818
Fatal        273
Name: count, dtype: int64


Light_Conditions
Daylight                       23702
Darkness - lights lit           8310
Darkness - no lighting           348
Darkness - lighting unknown      189
Darkness - lights unlit          108
Name: count, dtype: int64


District Area
Birmingham               13484
Sandwell                  3556
Coventry                  2990
Dudley                    2757
Walsall                   2696
Wolverhampton             2463
Solihull                  1955
Warwick                   1593
Nuneaton and Bedworth     1163
Name: count, dtype: int64


Road_Surface_Conditions
Dry                     22386
Wet or 

- Accident Severity: Most accidents are slight (27,575), with fewer serious (3,658) and fatal (261) cases.
Light Conditions: Majority occur in daylight (22,892), followed by darkness with lights on (8,022).

- District Area: Birmingham has the highest number of accidents (13,484), followed by Sandwell (3,556) and Coventry (2,990).

- Road Surface Conditions: Most accidents happen on dry roads (21,583), with wet/damp conditions contributing to 8,912 accidents.

- Road Type: Single carriageways see the most accidents (21,901), while slip roads have the fewest (362).

- Urban vs Rural: Accidents are predominantly in urban areas (29,584) compared to rural (1,910).

- Vehicle Type: Cars are involved in the majority of accidents (22,487), followed by vans (1,826) and buses/coaches (1,380)._

## Multivariate (Non-Graphical)

### Correlation Matrix (Numerical Variables)


In [17]:
# Compute the correlation matrix
correlation_matrix = df.select_dtypes(include='number').corr()

# Display the correlation matrix
correlation_matrix

Unnamed: 0,Latitude,Longitude,Number_of_Casualties,Number_of_Vehicles
Latitude,1.0,-0.595083,0.011178,0.007955
Longitude,-0.595083,1.0,-0.025096,0.004277
Number_of_Casualties,0.011178,-0.025096,1.0,0.226398
Number_of_Vehicles,0.007955,0.004277,0.226398,1.0


Number of Casualties and Number of Vehicles: A moderate positive correlation (0.226) indicates that accidents with more vehicles tend to involve more casualties.