# Exploratory Data Analysis Notebook

## Objectives

* Conduct statistical testing to explore and understand the nature of the dataset in terms of its distribution and tendency, idenfity anomalies present, and get a general overview of the data's patterns.

## Inputs

* Write down which data or information you need to run the notebook 

## Outputs

* Write here which files, code or artefacts you generate by the end of the notebook 

## Additional Comments

* If you have any additional comments that don't fit in the previous bullets, please state them here. 



# Change working directory

* We are assuming you will store the notebooks in a subfolder, therefore when running the notebook in the editor, you will need to change the working directory

We need to change the working directory from its current folder to its parent folder
* We access the current directory with os.getcwd()

In [None]:
import os
current_dir = os.getcwd()
current_dir

We want to make the parent of the current directory the new current directory
* os.path.dirname() gets the parent directory
* os.chir() defines the new current directory

In [None]:
os.chdir(os.path.dirname(current_dir))
print("You set a new current directory")

Confirm the new current directory

In [None]:
current_dir = os.getcwd()
current_dir

# Section 1: Understanding the data

In [4]:
# Import required libaries 
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

Before conducting statistical and graphical analysis, we must first inspect the dataset itself. 

In [None]:
# Load the data 
df = pd.read_csv("filtered_accident_data_set.csv") 

# Display basic structure 
print("Dataset Shape:", df.shape)  # Provides the number of rows and columns in the dataset
print("First few rows:\n", df.head()) # Displays the first few rows of the dataset

Our cleaned and filtered dataset amounts to 31,494 rows, with 13 columns. Using the head() function, we can see the first few rows. However, these functions alone do not tell us about the completeness of the dataset or whether the data types have been correctly set. Therefore, we will use the Pandas `info()` function to get a better look at the structure of the dataset.

In [None]:
# summary of datatypes, non-null's and memory usage
df.info()

From this output, the following is evident:
- The majority of the columns (9 out of 13) are categorical, as indicated by the `object` data type.
- None of the columns contain null values which makes it ideal for analysis.
- The Accident Date column is incorrectly set as an `object` type. This will need to be converted to a datetime format.

In [None]:
# Convert the 'Accident Date' column to datetime format
df['Accident Date'] = pd.to_datetime(df['Accident Date'], dayfirst=True)

# dayfirst=True explicitly specifies the UK date format (day/month/year). Default is month/day/year.

df['Accident Date'].dtype

Here we can see the column is now set to a datetime format.

To verify that the recorded date is correct and in the UK format, we can extract the constituent parts.

In [None]:
date_to_check = df['Accident Date'].iloc[0] # Get the first date in the column
print(date_to_check) # Display the date in the column

# Extract the day, month and year from the date to verify the conversion in UK date format
df['Accident Date'].iloc[0].strftime('%d %B %Y')


## Univariate (Non-Graphical)

Univariate analysis is focused on inspecting one given variable at a time. Its main purpose is to serve as a summary of the data column's central tendency and the level of variation within it.

### Summary Statistics (numerical)
- describe()

In [None]:
df.describe()

- **Count**: As expected, the counts remain identical, indicating that there are no missing values within the dataset.
  
- **Mean**: 
    - **Number of Casualties**: The mean casualty count is **1.38**, suggesting that each accident results in just over 1 casualty on average.
    - **Number of Vehicles**: The mean vehicle count is **1.85**, implying that each accident involves around 1-2 vehicles.

- **Min**:
    - **Number of Casualties**: The minimum number of casualties in any accident is **1**.
    - **Number of Vehicles**: The minimum number of vehicles involved in an accident is **1**.

- **Max**:
    - **Number of Casualties**: The maximum number of casualties in a single accident is **24**.
    - **Number of Vehicles**: The maximum number of vehicles involved in an accident is **10**.

- **Standard Deviation**:
    - **Number of Casualties**: The standard deviation is **0.82**, indicating that while most accidents involve 1 or 2 casualties, there are a few with significantly more.
    - **Number of Vehicles**: The standard deviation is **0.67**, suggesting that most accidents involve around 1-2 vehicles, with a few accidents involving more.

- **Percentiles**:
    - **25% (Q1)**: Most accidents (25% of the dataset) involve **1 casualty** and **1 vehicle**.
    - **50% (Median)**: The median accident involves **1 casualty** and between **1-2 vehicles**.
    - **75% (Q3)**: 75% of accidents involve **1 casualty** and **2 vehicles**.

### Median & Mode


As the mean was covered in the describe() function, we will calculate the mode and median.

- Mode: to identify the most frequent value in the dataset columns.
- Median: to provide a 'typical' value of a given column in the dataset that is resistant to outliers.

The **mode** (most frequent values) in the dataset provides insights into the most common characteristics of accidents.

In [None]:
# mode of the columns
df.mode()

- **Accident Severity**: Slight
- **Accident Date**: 2020-07-09
- **Latitude**: 52.458798
- **Light Conditions**: Daylight
- **District Area**: Birmingham
- **Longitude**: -1.871043
- **Number of Casualties**: 1
- **Number of Vehicles**: 2
- **Road Surface Conditions**: Dry
- **Road Type**: Single carriageway
- **Urban or Rural Area**: Urban
- **Vehicle Type**: Car

This row represents the most frequent combination of attributes in the dataset, giving an overview of the typical accident scenario based on the dataset.

In [None]:
# median for numerical values only
df.select_dtypes(include='number').median()

The **median** values for the numerical columns in the dataset represent the central point (50th percentile) of the data. Below are the median values for the relevant numerical attributes:

- **Latitude**: 52.481789 (central latitude of accident locations)
- **Longitude**: -1.901860 (central longitude of accident locations)
- **Number of Casualties**: 1 (median number of casualties in accidents)
- **Number of Vehicles**: 2 (median number of vehicles involved in accidents)

These values indicate typical or central tendencies for each of the numerical columns in the dataset.

### Skewness, Kurtosis, and Shapiro-Wilk Test


We are ignoring **Longitude** and **Latitude** because they represent geographic positions, not quantities that follow a distribution we would typically assess using these statistical tests.

In [None]:
# test skewness of numerical columns
df.select_dtypes(include='number').skew()

Both Number of Casualties and Number of Vehicles show positive skewness, suggesting that most accidents tend to involve fewer vehicles and casualties, with a few outliers having more.

In [None]:
# test kurtosis of numerical columns
df.select_dtypes(include='number').kurt()

Kurtosis measures the "tailedness" of a distribution. A high kurtosis indicates that the data has heavy tails or outliers, while low kurtosis suggests lighter tails and fewer outliers.

Number of Casualties shows a high kurtosis, meaning there are some accidents with significantly more casualties than the majority.
Number of Vehicles also has moderate kurtosis, suggesting that most accidents involve fewer vehicles, but there are some accidents with notably more vehicles involved.

In [None]:
# Shapiro-Wilk test for normality
from scipy import stats

shapiro = stats.shapiro(df.select_dtypes(include='number'))
shapiro

The Shapiro-Wilk test result shows that the data is not normally distributed:

- **Statistic: 0.5952:** A value significantly lower than 1 indicates a substantial deviation from normality.
- **p-value:** The extremely small p-value suggests strong evidence against the null hypothesis, which states that the data is normally distributed. Since the p-value is far below the common significance level of 0.05, we reject the null hypothesis and conclude that the data does not follow a normal distribution.

This means that, moving forward, we will need to use non-parametric methods for statistical testing.

### Frequency Distribution


## Multivariate (Non-Graphical)