#Dataset Description

\

## The dataset contains information about road traffic accidents with the following columns:

`Time`: Time of the accident

`Day_of_week`: Day of the week when the accident occurred

`Age_band_of_driver`: Age group of the driver

`Sex_of_driver`: Gender of the driver

`Educational_level`: Educational level of the driver

`Vehicle_driver_relation`: Relationship of the driver to the vehicle

`Driving_experience`: Driving experience of the driver

`Type_of_vehicle`: Type of vehicle involved in the accident

`Owner_of_vehicle`: Owner of the vehicle

`Service_year_of_vehicle`: Service years of the vehicle

`Area_accident_occured`: Area where the accident occurred

`Lanes_or_Medians`: Lanes or medians at the accident site

`Road_allignment`: Road alignment at the accident site

`Types_of_Junction`: Type of junction at the accident site

`Road_surface_type`: Type of road surface at the accident site

`Road_surface_conditions`: Road surface conditions at the accident site

`Light_conditions`: Light conditions at the time of the accident

`Weather_conditions`: Weather conditions at the time of the accident

`Type_of_collision`: Type of collision

`Number_of_vehicles_involved`: Number of vehicles involved in the accident

`Number_of_casualties`: Number of casualties in the accident

`Vehicle_movement`: Movement of the vehicle during the accident

`Casualty_class`: Class of casualty (driver, passenger, pedestrian)

`Sex_of_casualty`: Gender of the casualty

`Age_band_of_casualty`: Age group of the casualty

`Casualty_severity`: Severity of the casualty

`Work_of_casuality`: Occupation of the casualty

`Fitness_of_casuality`: Fitness of the casualty

`Pedestrian_movement`: Movement of the pedestrian

`Cause_of_accident`: Cause of the accident

`Accident_severity`: Severity of the accident


# Tasks

##1. Data Cleaning

### Read the dataset

In [3]:
import pandas as pd
df = pd.read_csv('/content/Task (1) Dataset.csv')

### Handle Missing Values

In [6]:
# Main reason why i dropped the data instead of filled the data was because it was harder to get the average of an individual persons car crash records, if it were a majourity group-
# I would have used the mode or mean to calculate the average of each rows by using df.fillna().mode()

df = df.dropna()

df.isnull().sum()

Time                           0
Day_of_week                    0
Age_band_of_driver             0
Sex_of_driver                  0
Educational_level              0
Vehicle_driver_relation        0
Driving_experience             0
Type_of_vehicle                0
Owner_of_vehicle               0
Service_year_of_vehicle        0
Defect_of_vehicle              0
Area_accident_occured          0
Lanes_or_Medians               0
Road_allignment                0
Types_of_Junction              0
Road_surface_type              0
Road_surface_conditions        0
Light_conditions               0
Weather_conditions             0
Type_of_collision              0
Number_of_vehicles_involved    0
Number_of_casualties           0
Vehicle_movement               0
Casualty_class                 0
Sex_of_casualty                0
Age_band_of_casualty           0
Casualty_severity              0
Work_of_casuality              0
Fitness_of_casuality           0
Pedestrian_movement            0
Cause_of_a

### Correct any inconsistent data entries.

In [50]:
df.info()
print(df['Day_of_week'].unique())
#No data were corrected as multiple checks for different columns were done and found no inconsistent data entries example of how the check was done is above and this check was done for every column

<class 'pandas.core.frame.DataFrame'>
Index: 2889 entries, 8 to 12315
Data columns (total 32 columns):
 #   Column                       Non-Null Count  Dtype         
---  ------                       --------------  -----         
 0   Time                         2889 non-null   datetime64[ns]
 1   Day_of_week                  2889 non-null   object        
 2   Age_band_of_driver           2889 non-null   object        
 3   Sex_of_driver                2889 non-null   object        
 4   Educational_level            2889 non-null   object        
 5   Vehicle_driver_relation      2889 non-null   object        
 6   Driving_experience           2889 non-null   object        
 7   Type_of_vehicle              2889 non-null   object        
 8   Owner_of_vehicle             2889 non-null   object        
 9   Service_year_of_vehicle      2889 non-null   object        
 10  Defect_of_vehicle            2889 non-null   object        
 11  Area_accident_occured        2889 non-null   ob

### Ensure data types are appropriate for each column.

---



In [37]:
df['Time'] = pd.to_datetime(df['Time'])
df['Number_of_vehicles_involved'] = df['Number_of_vehicles_involved'].astype(int)


Could not infer format, so each element will be parsed individually, falling back to `dateutil`. To ensure parsing is consistent and as-expected, please specify a format.



## 2. Exploratory Data Analysis (EDA)

### Perform summary statistics on the dataset.

In [9]:
df.describe()

Unnamed: 0,Number_of_vehicles_involved,Number_of_casualties
count,2889.0,2889.0
mean,2.011076,1.529249
std,0.635308,0.993012
min,1.0,1.0
25%,2.0,1.0
50%,2.0,1.0
75%,2.0,2.0
max,7.0,8.0


### Identify and analyze patterns in the data.

In [55]:
grouped_crashes = df.groupby('Type_of_vehicle').agg(
    total_crashes=('Accident_severity', 'count'),
    total_vehicles=('Number_of_vehicles_involved', 'sum'),
    total_casualties=('Number_of_casualties', 'sum'),
    average_vehicles_involved=('Number_of_vehicles_involved', 'mean')
)
grouped_crashes

time_grouped_crashes = df.groupby(['Day_of_week','Time']).agg(
    total_crashes=('Accident_severity', 'count'),
    total_vehicles=('Number_of_vehicles_involved', 'sum'),
    total_casualties=('Number_of_casualties', 'sum'),
    average_vehicles_involved=('Number_of_vehicles_involved', 'mean')
)
print(grouped_crashes)
time_grouped_crashes

                      total_crashes  total_vehicles  total_casualties  \
Type_of_vehicle                                                         
Automobile                      798            1599              1218   
Bajaj                            10              20                16   
Bicycle                           5              10                 8   
Long lorry                       80             163               126   
Lorry (11?40Q)                  137             266               209   
Lorry (41?100Q)                 579            1176               897   
Motorcycle                       68             136               116   
Other                           297             597               463   
Pick up upto 10Q                182             362               289   
Public (12 seats)               198             418               300   
Public (13?45 seats)            131             255               201   
Public (> 45 seats)             108             214

Unnamed: 0_level_0,Unnamed: 1_level_0,total_crashes,total_vehicles,total_casualties,average_vehicles_involved
Day_of_week,Time,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
Friday,2024-07-25 00:45:00,1,2,1,2.0
Friday,2024-07-25 00:55:00,1,2,1,2.0
Friday,2024-07-25 01:00:00,2,3,2,1.5
Friday,2024-07-25 01:05:00,3,6,6,2.0
Friday,2024-07-25 01:46:00,2,4,4,2.0
...,...,...,...,...,...
Wednesday,2024-07-25 22:00:00,1,2,1,2.0
Wednesday,2024-07-25 22:06:00,1,2,1,2.0
Wednesday,2024-07-25 22:10:00,2,4,2,2.0
Wednesday,2024-07-25 22:34:00,1,1,1,1.0


### Visualize the distribution of key variables (e.g., Age_band_of_driver, Type_of_vehicle).

In [25]:
import plotly.express as px

fig1 = px.histogram(df, x='Age_band_of_driver', y='Number_of_vehicles_involved', color='Road_surface_conditions', title='Distribution of age band of drivers by how many vehicles were in involved depending on the road condition')
fig1.show()

### Explore relationships between variables (e.g., Age_band_of_driver vs. Accident_severity).


In [39]:
df.groupby('Age_band_of_driver')['Accident_severity'].value_counts()

df.groupby(['Day_of_week','Time'])['Accident_severity'].value_counts()

Day_of_week  Time                 Accident_severity
Friday       2024-07-25 00:45:00  Slight Injury        1
             2024-07-25 00:55:00  Slight Injury        1
             2024-07-25 01:00:00  Slight Injury        2
             2024-07-25 01:05:00  Slight Injury        3
             2024-07-25 01:46:00  Slight Injury        2
                                                      ..
Wednesday    2024-07-25 22:06:00  Slight Injury        1
             2024-07-25 22:10:00  Slight Injury        2
             2024-07-25 22:34:00  Slight Injury        1
             2024-07-25 23:35:00  Serious Injury       3
                                  Slight Injury        1
Name: count, Length: 1622, dtype: int64

## 3. Data Visualization

* Ensure the visualizations are clear and informative.

### Create visualizations to illustrate the findings from the EDA.


In [54]:
fig2 = px.line(grouped_crashes, x=grouped_crashes.index, y='total_crashes', title='Vehicle type and their total crashes', markers=True)
fig2.show()

### Use appropriate plots such as histograms, bar charts, pie charts, scatter plots, and heatmaps.

In [34]:
import plotly.graph_objects as go

fig3 = go.Figure()

fig3.add_trace(go.Scatterpolar(
    r=grouped_crashes['total_crashes'],
    theta=grouped_crashes.index,
    name='Total crashes'
))

fig3.add_trace(go.Scatterpolar(
    r=grouped_crashes['total_vehicles'],
    theta=grouped_crashes.index,
    name='Total vehicles'
))

fig3.add_trace(go.Scatterpolar(
    r=grouped_crashes['total_casualties'],
    theta=grouped_crashes.index,
    name='Total casualties'
))

fig3.update_layout(
    polar=dict(
        radialaxis=dict(
            visible=False,
        )
    ),
    showlegend=True,
    title='Total number of crashes, vehicles, and casualties by day of the week'
)

fig3.show()

## 4. Insights and Conclusions

* <h3>Summarize the key insights gained from the data analysis.<h3/>
* <h3>Draw conclusions based on the patterns observed in the data.<h3/>

In [None]:
# Key insights
# 1. The age band with the most amount of accidents is 18-30 whilst the lowest is under 18
# 2. The most common accident severity is slight injury amongst all severities
# 3. Per accident severity, an average of 2 vehicles are involved in a car accident where Sunday had the maximum amount of cars involed at 2.05 and Thurday having the lowest cars involved at 1.93
# 4. Friday had the most crashes (476) whilst Sunday had the least amount of crashes (333)
# 5. 2889 cars in total were involved in a car accident
# 6. There were more unidentified age accidents then under 18 years old
# 7. Accidents that happened with the road surface condition being 'Flooded with water over 3cm' were rare amongst all age bands
# 8. Dry surface road accidents were more common then damp and wet surface conditions
# 9. The most vehicle type to cause a car accident is the automobile

# Conclusion
# Friday and saturday had high crashes counts because of the weekend, during the weekend it is assumed that more then usual people tend to go out.
# Second conclusion is that although accident counts are high, Slight injuries were gained from those who were aged 18-30 because of the strength and young age.
# Final conlusion is that people who are aged 51 or more tend to drive more carefully unlike people who are 18-30 driving more carelessly and having a high casualty level.