<a href="https://colab.research.google.com/github/LegendSeyi/Thai_Road_Accident_Data_Analysis/blob/main/Thai_Road_Accident_data_analysis.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **THAI ROAD ACCIDENT ANALYSIS**
#### **_By Adeyemi Oluwaseyi Emmanuel_**

[Click here to check Linkedin profile](https://www.linkedin.com/in/oluwaseyi-adeyemi-33b1ab197/)

[Twitter link](https://twitter.com/AmLegendseyi)

## **INTRODUCTION**

This dataset provides comprehensive statistics on recorded road accidents in Thailand, spanning from approximately 2019 to 2022. The data was sourced from raw information provided by the Office of the Permanent Secretary, Ministry of Transport, which is also utilized in this public dashboard for easier access and visualization. The dataset encompasses various aspects of road accidents and aims to shed light on the trends and patterns within this critical area of concern, analysis of this data could be crucial in guiding road safety policies and measures👍.

We import our required libraries

In [361]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

In [362]:
## creating a dictionary in python and converting it to a dataframe
tb = {'Column':['acc_code',
    'incident_datetime',
    'report_datetime',
    'province_th',
    'province_en',
    'agency',
    'route',
    'vehicle_type',
    'presumed_cause',
    'accident_type',
    'number_of_vehicles_involved',
    'number_of_fatalities',
    'number_of_injuries',
    'weather_condition',
    'latitude',
    'longitude',
    'road_description',
    'slope_description'],
    'Description':['The accident code or identifier',
    'The date and time of the accident occurrence',
    'The date and time when the accident was reported',
    'The name of the province in Thailand, written in Thai',
    'The name of the province in Thailand, written in English',
    'The government agency responsible for the road and traffic management',
    'The route or road segment where the accident occurred',
    'The type of vehicle involved in the accident',
    'The presumed cause or reason for the accident',
    'The type or nature of the accident',
    'The number of vehicles involved in the accident',
    'The number of fatalities resulting from the accident',
    'The number of injuries resulting from the accident',
    'The weather condition at the time of the accident',
    'The latitude coordinate of the accident location',
    'The longitude coordinate of the accident location',
    'The description of the road type or configuration where the accident occurred',
    'The description of the slope condition at the accident location']
     }

In [363]:
## Converting the dictionary to a data frame
dico = pd.DataFrame(tb)

In [364]:
dico

Unnamed: 0,Column,Description
0,acc_code,The accident code or identifier
1,incident_datetime,The date and time of the accident occurrence
2,report_datetime,The date and time when the accident was reported
3,province_th,"The name of the province in Thailand, written ..."
4,province_en,"The name of the province in Thailand, written ..."
5,agency,The government agency responsible for the road...
6,route,The route or road segment where the accident o...
7,vehicle_type,The type of vehicle involved in the accident
8,presumed_cause,The presumed cause or reason for the accident
9,accident_type,The type or nature of the accident


The above table is the description of each columns in the dataset....

With this information, we can now import our daraset for the analysis

## **UNDERSTANDING THE DATASET**

In [365]:
# We will use the pandas library to import our dataset

In [366]:
# We can import our dataset from github


url = 'https://raw.githubusercontent.com/LegendSeyi/Dataset/main/thai_road_accident_2019_2022.csv'

pd.set_option('display.max_columns', None)
data = pd.read_csv(url, encoding= 'utf-8')

In [367]:
data.head()

Unnamed: 0,acc_code,incident_datetime,report_datetime,province_th,province_en,agency,route,vehicle_type,presumed_cause,accident_type,number_of_vehicles_involved,number_of_fatalities,number_of_injuries,weather_condition,latitude,longitude,road_description,slope_description
0,571905,2019-01-01 00:00:00,2019-01-02 06:11:00,ลพบุรี,Loburi,department of rural roads,แยกทางหลวงหมายเลข 21 (กม.ที่ 31+000) - บ้านวัง...,motorcycle,driving under the influence of alcohol,other,1,0,2,clear,14.959105,100.873463,straight road,no slope
1,3790870,2019-01-01 00:03:00,2020-02-20 13:48:00,อุบลราชธานี,Ubon Ratchathani,department of highways,เดชอุดม - อุบลราชธานี,private/passenger car,speeding,rollover/fallen on straight road,1,0,2,clear,15.210738,104.862689,straight road,no slope
2,599075,2019-01-01 00:05:00,2019-01-01 10:35:00,ประจวบคีรีขันธ์,Prachuap Khiri Khan,department of highways,ปราณบุรี - ปากน้ำปราณ,motorcycle,speeding,head-on collision (not overtaking),2,1,0,clear,12.374259,99.907949,wide curve,slope area
3,571924,2019-01-01 00:20:00,2019-01-02 05:12:00,เชียงใหม่,Chiang Mai,department of rural roads,เชื่อมทางหลวงหมายเลข 1013 (กม.ที่ 8+200) - บ้า...,motorcycle,driving under the influence of alcohol,other,1,0,1,clear,18.601721,98.804204,straight road,no slope
4,599523,2019-01-01 00:25:00,2019-01-04 09:42:00,นครสวรรค์,Nakhon Sawan,department of highways,เกยไชย - ศรีมงคล,private/passenger car,cutting in closely by people/vehicles/animals,rollover/fallen on straight road,1,0,0,clear,15.866389,100.59001,straight road,no slope


We have imported our data, so we can check the size of the dataset

In [368]:
data.shape

(81735, 18)

We have 81735 row and 18 columns in the dataset

In [369]:
data.columns

Index(['acc_code', 'incident_datetime', 'report_datetime', 'province_th',
       'province_en', 'agency', 'route', 'vehicle_type', 'presumed_cause',
       'accident_type', 'number_of_vehicles_involved', 'number_of_fatalities',
       'number_of_injuries', 'weather_condition', 'latitude', 'longitude',
       'road_description', 'slope_description'],
      dtype='object')

we generated the list of columns available, but out first column is not properly named..

So we have successfully renamed the column

Generate the information of the dataset

In [370]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 81735 entries, 0 to 81734
Data columns (total 18 columns):
 #   Column                       Non-Null Count  Dtype  
---  ------                       --------------  -----  
 0   acc_code                     81735 non-null  int64  
 1   incident_datetime            81735 non-null  object 
 2   report_datetime              81735 non-null  object 
 3   province_th                  81735 non-null  object 
 4   province_en                  81735 non-null  object 
 5   agency                       81735 non-null  object 
 6   route                        81735 non-null  object 
 7   vehicle_type                 81735 non-null  object 
 8   presumed_cause               81735 non-null  object 
 9   accident_type                81735 non-null  object 
 10  number_of_vehicles_involved  81735 non-null  int64  
 11  number_of_fatalities         81735 non-null  int64  
 12  number_of_injuries           81735 non-null  int64  
 13  weather_conditio

We can observe that we have different data types such ad float, int, object

In [371]:
# code to generate the number of null values present in the columns
data.isnull().sum()

acc_code                         0
incident_datetime                0
report_datetime                  0
province_th                      0
province_en                      0
agency                           0
route                            0
vehicle_type                     0
presumed_cause                   0
accident_type                    0
number_of_vehicles_involved      0
number_of_fatalities             0
number_of_injuries               0
weather_condition                0
latitude                       359
longitude                      359
road_description                 0
slope_description                0
dtype: int64

we can observe that column latitude and longitude contain the same number of missing values

## **DATA CLEANING**

### **Incident_datetime column**

In [372]:
# We start by analysing the 'incident datetime' column , we convert the type of column into a datetime
#data['incident_datetime'] = pd.to_datetime(data['incident_datetime'])

In [373]:
data['incident_datetime'].info()

<class 'pandas.core.series.Series'>
RangeIndex: 81735 entries, 0 to 81734
Series name: incident_datetime
Non-Null Count  Dtype 
--------------  ----- 
81735 non-null  object
dtypes: object(1)
memory usage: 638.7+ KB


The column has been converted to a datetime column. so lets check if there are no null values in the dataset

In [374]:
data['incident_datetime'].isnull().sum()

0

We can confirm that there are no null values.

We can sort the data set with the incident datetime column

In [375]:
data.sort_values(by='incident_datetime', inplace=True)

Now we can be sure that the dataset has been sorted

**Now we want to create two different column from the datetime column, one of the date (Year-Month-Day) and another for the time zone (Hour-minute-Seconds)**

In [376]:
# Creating a new column of just the date
# Creating a new column of just the time
data['incident_date'] = data['incident_datetime'].str[:10]
data['incident_time'] = data['incident_datetime'].str[10:]


Now we can drop/del the incident_datetime column and also rearrange the data bring the newly created columns to the front

In [377]:

# we use the pop function
first_col = data.pop('incident_date')
second_col = data.pop('incident_time')

# then we insert it the column number position, the name of the columns, and the coloumn we popped
data.insert(1, 'Incident_date', first_col)
data.insert(2, 'Incident_time', second_col)


### **Report_datetime column**

Now we want to create two different column from the datetime column, one of the date (Year-Month-Day) and another for the time zone (Hour-minute-Seconds)

In [378]:
# Creating a new column of just the date
# Creating a new column of just the time
data['report_date'] = data['report_datetime'].str[:10]
data['report_time'] = data['report_datetime'].str[10:]

Now we can drop/del the incident_datetime column and also rearrange the data bring the newly created columns to the front

i use Ctrl + shift + L to multiple select a repeated string after highlighting of of it

In [379]:
# we use the pop function
first_col = data.pop('report_date')
second_col = data.pop('report_time')

# then we insert it the column number position, the name of the columns, and the coloumn we popped
data.insert(3, 'report_date', first_col)
data.insert(4, 'report_time', second_col)



In [380]:
# we can use the drop function to delete the report datetime column
data.drop(['report_datetime','incident_datetime'], axis=1, inplace=True)
data.head()


Unnamed: 0,acc_code,Incident_date,Incident_time,report_date,report_time,province_th,province_en,agency,route,vehicle_type,presumed_cause,accident_type,number_of_vehicles_involved,number_of_fatalities,number_of_injuries,weather_condition,latitude,longitude,road_description,slope_description
0,571905,2019-01-01,00:00:00,2019-01-02,06:11:00,ลพบุรี,Loburi,department of rural roads,แยกทางหลวงหมายเลข 21 (กม.ที่ 31+000) - บ้านวัง...,motorcycle,driving under the influence of alcohol,other,1,0,2,clear,14.959105,100.873463,straight road,no slope
1,3790870,2019-01-01,00:03:00,2020-02-20,13:48:00,อุบลราชธานี,Ubon Ratchathani,department of highways,เดชอุดม - อุบลราชธานี,private/passenger car,speeding,rollover/fallen on straight road,1,0,2,clear,15.210738,104.862689,straight road,no slope
2,599075,2019-01-01,00:05:00,2019-01-01,10:35:00,ประจวบคีรีขันธ์,Prachuap Khiri Khan,department of highways,ปราณบุรี - ปากน้ำปราณ,motorcycle,speeding,head-on collision (not overtaking),2,1,0,clear,12.374259,99.907949,wide curve,slope area
3,571924,2019-01-01,00:20:00,2019-01-02,05:12:00,เชียงใหม่,Chiang Mai,department of rural roads,เชื่อมทางหลวงหมายเลข 1013 (กม.ที่ 8+200) - บ้า...,motorcycle,driving under the influence of alcohol,other,1,0,1,clear,18.601721,98.804204,straight road,no slope
4,599523,2019-01-01,00:25:00,2019-01-04,09:42:00,นครสวรรค์,Nakhon Sawan,department of highways,เกยไชย - ศรีมงคล,private/passenger car,cutting in closely by people/vehicles/animals,rollover/fallen on straight road,1,0,0,clear,15.866389,100.59001,straight road,no slope


**From our observation of the datetime columns, we noticed that in some rows there are difference in the reported  date/time and when the inccident occurred.... we can get some insight by analysing this as to why the delay occurred**

### **Province_th Column**

In [381]:
# preview the column
data['province_th']

0                 ลพบุรี
1            อุบลราชธานี
2        ประจวบคีรีขันธ์
3              เชียงใหม่
4              นครสวรรค์
              ...       
81730            นนทบุรี
81731               ตราด
81732          มหาสารคาม
81733          กำแพงเพชร
81734           เชียงราย
Name: province_th, Length: 81735, dtype: object

From the description, this column is the names of the provinces the incidents occurred..

I decided to drop to column, the texts were not well represented..

In [382]:
data.drop('province_th', axis=1, inplace=True)

### **Province_en Column**

In [383]:
#preview the column
data['province_en']

0                     Loburi
1           Ubon Ratchathani
2        Prachuap Khiri Khan
3                 Chiang Mai
4               Nakhon Sawan
                ...         
81730             Nonthaburi
81731                   Trat
81732          Maha Sarakham
81733         Kamphaeng Phet
81734             Chiang Rai
Name: province_en, Length: 81735, dtype: object

Lets check the number of province involved

In [384]:
# LIst of the unique province in the dataset
data['province_en'].unique()

array(['Loburi', 'Ubon Ratchathani', 'Prachuap Khiri Khan', 'Chiang Mai',
       'Nakhon Sawan', 'Mae Hong Son', 'Chumphon', 'Sing Buri',
       'Songkhla', 'Lamphun', 'Trat', 'Phuket', 'Saraburi', 'Ratchaburi',
       'Phra Nakhon Si Ayutthaya', 'Nakhon Ratchasima',
       'Nakhon Si Thammarat', 'Chaiyaphum', 'Kalasin', 'Suphan Buri',
       'Phetchaburi', 'Phrae', 'Chai Nat', 'Prachin Buri',
       'Nakhon Pathom', 'Kanchanaburi', 'Phetchabun', 'Ang Thong',
       'Nonthaburi', 'Samut Prakan', 'Bangkok', 'Phayao', 'Phatthalung',
       'Yala', 'Maha Sarakham', 'Surat Thani', 'Amnat Charoen',
       'Nong Khai', 'Nan', 'Phangnga', 'Narathiwat', 'Samut Sakhon',
       'Chanthaburi', 'Samut Songkhram', 'Phitsanulok', 'Pathum Thani',
       'Tak', 'Loei', 'Chiang Rai', 'Chachoengsao', 'Buri Ram',
       'Uthai Thani', 'Krabi', 'Surin', 'Udon Thani', 'Si Sa Ket',
       'Uttaradit', 'Khon Kaen', 'Kamphaeng Phet', 'Yasothon', 'Satun',
       'Nakhon Nayok', 'Chon Buri', 'Rayong', 'buogkan'

Let convert this column to a categorical column

In [385]:
data['province_en'] = data['province_en'].astype('category')

In [386]:
data['province_en'].info()

<class 'pandas.core.series.Series'>
Int64Index: 81735 entries, 0 to 81734
Series name: province_en
Non-Null Count  Dtype   
--------------  -----   
81735 non-null  category
dtypes: category(1)
memory usage: 721.0 KB


Renaming the province_en to province since we dropped the thai-meanning

In [387]:
data.columns = [col.replace("_en", "") for col in data.columns]

In [388]:
data.head(3)

Unnamed: 0,acc_code,Incident_date,Incident_time,report_date,report_time,province,agency,route,vehicle_type,presumed_cause,accident_type,number_of_vehicles_involved,number_of_fatalities,number_of_injuries,weather_condition,latitude,longitude,road_description,slope_description
0,571905,2019-01-01,00:00:00,2019-01-02,06:11:00,Loburi,department of rural roads,แยกทางหลวงหมายเลข 21 (กม.ที่ 31+000) - บ้านวัง...,motorcycle,driving under the influence of alcohol,other,1,0,2,clear,14.959105,100.873463,straight road,no slope
1,3790870,2019-01-01,00:03:00,2020-02-20,13:48:00,Ubon Ratchathani,department of highways,เดชอุดม - อุบลราชธานี,private/passenger car,speeding,rollover/fallen on straight road,1,0,2,clear,15.210738,104.862689,straight road,no slope
2,599075,2019-01-01,00:05:00,2019-01-01,10:35:00,Prachuap Khiri Khan,department of highways,ปราณบุรี - ปากน้ำปราณ,motorcycle,speeding,head-on collision (not overtaking),2,1,0,clear,12.374259,99.907949,wide curve,slope area


### **Agency Column**

In [389]:
## Getting the categories of the column
data['agency'].unique()

array(['department of rural roads', 'department of highways',
       'expressway authority of thailand'], dtype=object)

We have three government agency involved :

*   department of rural roads  
*   department of highways
*   expressway authority of thailand




### **Route Column**

In [390]:
data['route'].head()

0    แยกทางหลวงหมายเลข 21 (กม.ที่ 31+000) - บ้านวัง...
1                                เดชอุดม - อุบลราชธานี
2                                ปราณบุรี - ปากน้ำปราณ
3    เชื่อมทางหลวงหมายเลข 1013 (กม.ที่ 8+200) - บ้า...
4                                     เกยไชย - ศรีมงคล
Name: route, dtype: object

From the list above, we noticed the route was recorded in thai language. so we will translate the language from thai to english.

In [391]:
# getting the unique route in the coulmn
data['route'].nunique()

3882

there are 3882 unique route in the column, we will translate this unique routes to english and use it as a dictionary to interprete the entire route column

In [392]:
# importing the csv file that has the translation of the route unique value
route_dic = pd.read_csv('https://raw.githubusercontent.com/LegendSeyi/Dataset/main/tranlated_thai_routes.csv', encoding='utf-8')
route_dic

Unnamed: 0,Thai,English
0,แยกทางหลวงหมายเลข 21 (กม.ที่ 31+000) - บ้านวัง...,Highway 21 (km 31+000) - Ban Wang Phloeng
1,เดชอุดม - อุบลราชธานี,Dej Udom - Ubon Ratchathani
2,ปราณบุรี - ปากน้ำปราณ,Pranburi - Pak Nam Pran
3,เชื่อมทางหลวงหมายเลข 1013 (กม.ที่ 8+200) - บ้า...,Connect Highway 1013 (km 8+200) - Ban Khun Klang
4,เกยไชย - ศรีมงคล,Kayachai -Sri Mongkhon
...,...,...
3877,แยกทางหลวงหมายเลข 3204 (กม.ที่ 16+940) - บ้านขลู่,Highway intersection number 3204 (km 16+940) -...
3878,แยกทางหลวงหมายเลข 3 (กม.ที่ 452+700) - บ้านตาหนึก,Highway 3 (Km. 452+700) - Ban Ta Nu
3879,แยกทางหลวง 3329 (กม.ที่ 16+720) -บ้านนิคมเขาบ่...,Highway 3329 (Km 16+720) -Ban Nikhom Khao Bo Kaeo
3880,แยกทางหลวงหมายเลข 347 (กม.ที่ 21+700) - บ้านรุน,Highway intersection number 347 (Km. 21+700) -...


In [393]:

# Create a mapping dictionary from the CSV file
mapping_dict = dict(zip(route_dic['Thai'], route_dic['English']))

data['route'] = data['route'].map(mapping_dict)
data.head()

Unnamed: 0,acc_code,Incident_date,Incident_time,report_date,report_time,province,agency,route,vehicle_type,presumed_cause,accident_type,number_of_vehicles_involved,number_of_fatalities,number_of_injuries,weather_condition,latitude,longitude,road_description,slope_description
0,571905,2019-01-01,00:00:00,2019-01-02,06:11:00,Loburi,department of rural roads,Highway 21 (km 31+000) - Ban Wang Phloeng,motorcycle,driving under the influence of alcohol,other,1,0,2,clear,14.959105,100.873463,straight road,no slope
1,3790870,2019-01-01,00:03:00,2020-02-20,13:48:00,Ubon Ratchathani,department of highways,Dej Udom - Ubon Ratchathani,private/passenger car,speeding,rollover/fallen on straight road,1,0,2,clear,15.210738,104.862689,straight road,no slope
2,599075,2019-01-01,00:05:00,2019-01-01,10:35:00,Prachuap Khiri Khan,department of highways,Pranburi - Pak Nam Pran,motorcycle,speeding,head-on collision (not overtaking),2,1,0,clear,12.374259,99.907949,wide curve,slope area
3,571924,2019-01-01,00:20:00,2019-01-02,05:12:00,Chiang Mai,department of rural roads,Connect Highway 1013 (km 8+200) - Ban Khun Klang,motorcycle,driving under the influence of alcohol,other,1,0,1,clear,18.601721,98.804204,straight road,no slope
4,599523,2019-01-01,00:25:00,2019-01-04,09:42:00,Nakhon Sawan,department of highways,Kayachai -Sri Mongkhon,private/passenger car,cutting in closely by people/vehicles/animals,rollover/fallen on straight road,1,0,0,clear,15.866389,100.59001,straight road,no slope


In [394]:
data['route'].nunique()

3872

In [395]:
#Checking if the change is applied
data.head(3)

Unnamed: 0,acc_code,Incident_date,Incident_time,report_date,report_time,province,agency,route,vehicle_type,presumed_cause,accident_type,number_of_vehicles_involved,number_of_fatalities,number_of_injuries,weather_condition,latitude,longitude,road_description,slope_description
0,571905,2019-01-01,00:00:00,2019-01-02,06:11:00,Loburi,department of rural roads,Highway 21 (km 31+000) - Ban Wang Phloeng,motorcycle,driving under the influence of alcohol,other,1,0,2,clear,14.959105,100.873463,straight road,no slope
1,3790870,2019-01-01,00:03:00,2020-02-20,13:48:00,Ubon Ratchathani,department of highways,Dej Udom - Ubon Ratchathani,private/passenger car,speeding,rollover/fallen on straight road,1,0,2,clear,15.210738,104.862689,straight road,no slope
2,599075,2019-01-01,00:05:00,2019-01-01,10:35:00,Prachuap Khiri Khan,department of highways,Pranburi - Pak Nam Pran,motorcycle,speeding,head-on collision (not overtaking),2,1,0,clear,12.374259,99.907949,wide curve,slope area


### **Vehicle Type Column**

In [396]:
data['vehicle_type'].head()

0               motorcycle
1    private/passenger car
2               motorcycle
3               motorcycle
4    private/passenger car
Name: vehicle_type, dtype: object

In [397]:
## checking if there is no null value
data['vehicle_type'].isnull().sum()

0

we have no null values

In [398]:
# Checking the catergory of vehicle involved in the accident
data['vehicle_type'].nunique()

15

15 different vehicle type was involved

### **Presumed Cause Column**

In [399]:
data['presumed_cause'].head()

0           driving under the influence of alcohol
1                                         speeding
2                                         speeding
3           driving under the influence of alcohol
4    cutting in closely by people/vehicles/animals
Name: presumed_cause, dtype: object

In [400]:
#checking if we have null values
data['presumed_cause'].isnull().sum()

0

**Checking the presumed cause of accident uniqueness**

In [401]:
data['presumed_cause'].unique()

array(['driving under the influence of alcohol', 'speeding',
       'cutting in closely by people/vehicles/animals',
       'failure to yield right of way', 'failure to yield/signal',
       'falling asleep', 'running red lights/traffic signals', 'other',
       'unfamiliarity with the route/unskilled driving',
       'vehicle equipment failure', 'illegal overtaking', 'tailgating',
       'ignoring stop sign while leaving intersection',
       'overloaded vehicle', 'insufficient light',
       'disabled vehicle without proper signals', 'abrupt lane change',
       'debris/obstruction on the road', 'reversing vehicle',
       'brake/anti-lock brake system failure', 'medical condition',
       'vehicle electrical system failure', 'driving in the wrong lane',
       'straddling lanes', 'dangerous curve',
       'failure to signal enter/exit parking', 'slippery road',
       'no traffic signs', 'sudden stop',
       'using mobile phone while driving',
       'driving without headlights/ill

In [402]:
data['presumed_cause'].nunique()

54

**We have 54 unique presumed cause of accident**

**From the above list, we can observe that some causes are not properly represented. The text format is not well represented in english language. so it is best to change it to english**

**We have 3 unique presumed causes that are not well represented**

In [403]:
# Using the googletrans library to translate the Thai text to english
!pip install googletrans==4.0.0-rc1



In [404]:
#importing the translator from the googletrans library.
from googletrans import Translator

# translate the language texts to english
thai_values = ['ป้ายจราจรชำรุด', 'เส้นแบ่งทิศทางจราจรชำรุด', 'มึนเมาจากแอลกอฮอล์'
]

# Initialize the translator
translator = Translator()

# Translate Thai values to English and store in a dictionary
translated_dict = {}
for thai_value in thai_values:
    translation = translator.translate(thai_value, src='th', dest='en')
    translated_dict[thai_value] = translation.text

# Print the translated dictionary
for thai_value, english_translation in translated_dict.items():
    print(f"'{thai_value}': '{english_translation}'")

'ป้ายจราจรชำรุด': 'Damaged traffic signs'
'เส้นแบ่งทิศทางจราจรชำรุด': 'Damaged traffic direction'
'มึนเมาจากแอลกอฮอล์': 'Dizzy from alcohol'


In [405]:
# This the dictionary containing the thai and english text translation
translated_dict

{'ป้ายจราจรชำรุด': 'Damaged traffic signs',
 'เส้นแบ่งทิศทางจราจรชำรุด': 'Damaged traffic direction',
 'มึนเมาจากแอลกอฮอล์': 'Dizzy from alcohol'}

**Defining a function can will replace the thai text with English text**

In [406]:
# define function that will replace the texts with english
def change_lang(x):
  if x == 'ป้ายจราจรชำรุด':
    x = 'Damaged traffic signs'
    return x
  elif x == 'เส้นแบ่งทิศทางจราจรชำรุด':
    x ='Damaged traffic direction'
    return x
  elif x == 'มึนเมาจากแอลกอฮอล์':
    x ='Dizzy from alcohol'
    return x
  else:
    return x

**Applying the function to the presumed cause column**

In [407]:
data['presumed_cause'] = data['presumed_cause'].apply(change_lang)

In [408]:
# Checking if the change is applied
data['presumed_cause'].unique()

array(['driving under the influence of alcohol', 'speeding',
       'cutting in closely by people/vehicles/animals',
       'failure to yield right of way', 'failure to yield/signal',
       'falling asleep', 'running red lights/traffic signals', 'other',
       'unfamiliarity with the route/unskilled driving',
       'vehicle equipment failure', 'illegal overtaking', 'tailgating',
       'ignoring stop sign while leaving intersection',
       'overloaded vehicle', 'insufficient light',
       'disabled vehicle without proper signals', 'abrupt lane change',
       'debris/obstruction on the road', 'reversing vehicle',
       'brake/anti-lock brake system failure', 'medical condition',
       'vehicle electrical system failure', 'driving in the wrong lane',
       'straddling lanes', 'dangerous curve',
       'failure to signal enter/exit parking', 'slippery road',
       'no traffic signs', 'sudden stop',
       'using mobile phone while driving',
       'driving without headlights/ill

**Done**

### **Accident Type Column**

**Checking the categories or type of accident**

In [409]:
data.accident_type.unique()

array(['other', 'rollover/fallen on straight road',
       'head-on collision (not overtaking)',
       'collision at intersection corner',
       'collision with obstruction (on road surface)',
       'rear-end collision', 'pedestrian collision',
       'rollover/fallen on curved road', 'collision during overtaking',
       'turning/retreating collision', 'side collision'], dtype=object)

In [410]:
# Checking if there are null values
data.accident_type.isnull().sum()

0

**There are no null values**

### **Number of vehicles involved Column**

In [411]:
data.number_of_vehicles_involved.unique()

array([ 1,  2,  3,  0,  4,  6,  5,  7, 12, 11, 10,  8,  9, 14, 19, 27, 24,
       13])

**range is from 0 - 27 vehicles**

In [412]:
data.number_of_vehicles_involved.isnull().sum()

0

**Also we have no null value**

### **number of fatalities Column**

In [413]:
#Checking for null values
data.number_of_fatalities.isnull().sum()

0

In [414]:
#checking for the range of number of fatalities
data.number_of_fatalities.unique()

array([ 0,  1,  3,  2,  5,  6,  9,  4, 10,  7, 11, 13,  8])

### **Number of injuries Column**

In [415]:
# Checking if there is a null value
data.number_of_injuries.isnull().sum()

0

In [416]:
#Checking the range
data.number_of_injuries.unique()

array([ 2,  0,  1,  5,  4,  3,  9,  7, 12, 14,  6,  8, 51, 13, 10, 11, 17,
       15, 35, 24, 20, 21, 30, 19, 28, 32, 18, 25, 22, 16, 23, 27, 43, 42,
       46, 31, 39, 49, 38, 44, 34])

### **Weather condition column**

Checking for the types of weather condition

In [417]:
data.weather_condition.unique()

array(['clear', 'foggy', 'dark', 'rainy', 'other', 'natural disaster',
       'land slide'], dtype=object)

In [418]:
# Checking if there are null values
data.weather_condition.isnull().sum()

0

### **Latitude and Longitude Column**

**Checking if there are null value in the latitude and longitude column**

In [419]:
data[['latitude', 'longitude']].isnull().sum()

latitude     359
longitude    359
dtype: int64

**And yes, there are null values in the columns**

**There are 359 in the columns**

In [420]:
# Getting the count of the data will the missing information on latitide and longitude
data.loc[(data['latitude'].isnull()) & (data['latitude'].isnull())].count()

acc_code                       359
Incident_date                  359
Incident_time                  359
report_date                    359
report_time                    359
province                       359
agency                         359
route                          359
vehicle_type                   359
presumed_cause                 359
accident_type                  359
number_of_vehicles_involved    359
number_of_fatalities           359
number_of_injuries             359
weather_condition              359
latitude                         0
longitude                        0
road_description               359
slope_description              359
dtype: int64

In [421]:
long_lat_null = data[['province','route','latitude', 'longitude']].loc[data['latitude'].isnull()]

In [422]:
long_lat_null['route'].describe()

count                            359
unique                            22
top       Chaeng Watthana-Phaya Thai
freq                              64
Name: route, dtype: object

**Chaeng Watthana-Phaya Thai** is the most frequently occuring route with Null value on the lat and long column occurring **64** times

Also we have **22** unique routes with missing value of the long and lat

Due to fact that the analysis is not focused on geographical location, dropping the longitude and latitude column will be ideal. And also because they contain missing values

In [423]:
data.drop('longitude', axis=1, inplace=True)
data.drop('latitude', axis=1, inplace=True)

In [424]:
data.head(3)

Unnamed: 0,acc_code,Incident_date,Incident_time,report_date,report_time,province,agency,route,vehicle_type,presumed_cause,accident_type,number_of_vehicles_involved,number_of_fatalities,number_of_injuries,weather_condition,road_description,slope_description
0,571905,2019-01-01,00:00:00,2019-01-02,06:11:00,Loburi,department of rural roads,Highway 21 (km 31+000) - Ban Wang Phloeng,motorcycle,driving under the influence of alcohol,other,1,0,2,clear,straight road,no slope
1,3790870,2019-01-01,00:03:00,2020-02-20,13:48:00,Ubon Ratchathani,department of highways,Dej Udom - Ubon Ratchathani,private/passenger car,speeding,rollover/fallen on straight road,1,0,2,clear,straight road,no slope
2,599075,2019-01-01,00:05:00,2019-01-01,10:35:00,Prachuap Khiri Khan,department of highways,Pranburi - Pak Nam Pran,motorcycle,speeding,head-on collision (not overtaking),2,1,0,clear,wide curve,slope area


**Done**

### **Road Description Column**

Checking the unique descriptions of the road

In [425]:
data['road_description'].unique()

array(['straight road', 'wide curve', 'other',
       'connecting to public/commercial area', 'sharp curve',
       'four-way intersection', 'connecting to private area',
       't-intersection', 'y-intersection',
       'grade-separated intersection/ramps', 'merge lane',
       'connecting to school area', 'lane-changing area', 'u-turn area',
       'roundabout', 'motorcycle lane', 'pedestrian path',
       'bridge (across river/canal)',
       'zebra crossing/pedestrian crossing'], dtype=object)

In [426]:
# how many unique description do we have
data['road_description'].nunique()

19

We have **19** unique description of the road

In [427]:
data['road_description'].isnull().sum()

0

There are no null values

**Done**

### **Slope Description Column**

Checking the unique descriptions of the slope

In [428]:
data['slope_description'].unique()

array(['no slope', 'slope area', 'other'], dtype=object)

There just **3** main categories for the slope description

In [429]:
data['slope_description'].isnull().sum()

0

There are no null Values in the column.


**Done**

## **EXPLORATORY DATA ANALYSIS (EDA)**



**In this section, well be analysing and exploring each column of the dataset as well as the insight we can derive from the dataset**

### **Questions for to answered through exploratory data analysis**
**Making some questions and try to answer them by the use of data.**

* <span style='color: lightgreen ; font-size: 1.5em ; font-weight: bold'>&#10003;</span> ***Which month has the highest accident record***
* <span style='color: lightgreen ; font-size: 1.5em ; font-weight: bold'>&#10003;</span> ***which year has the highest record of accident***
* <span style='color: lightgreen ; font-size: 1.5em ; font-weight: bold'>&#10003;</span> ***What time/season of the day is known to have the highest counts of accident***
* <span style='color: lightgreen ; font-size: 1.5em ; font-weight: bold'>&#10003;</span> ***which agency is reponsible with the highest accident rate***
* <span style='color: lightgreen ; font-size: 1.5em ; font-weight: bold'>&#10003;</span> ***What are the provinces with the highest number of accidents?***
* <span style='color: lightgreen ; font-size: 1.5em ; font-weight: bold'>&#10003;</span> ***How does the number of accidents vary with the time?***
* <span style='color: lightgreen ; font-size: 1.5em ; font-weight: bold'>&#10003;</span> ***What is the most dangerous route by province?***
* <span style='color: lightgreen ; font-size: 1.5em ; font-weight: bold'>&#10003;</span> ***What is the most frequent type of vehicle involved in accidents?***
* <span style='color: lightgreen ; font-size: 1.5em ; font-weight: bold'>&#10003;</span> ***Ranking of presumed causes ordered by number of accidents.***
* <span style='color: lightgreen ; font-size: 1.5em ; font-weight: bold'>&#10003;</span> ***How do the weather conditions affect to accidents?***
* <span style='color: lightgreen ; font-size: 1.5em ; font-weight: bold'>&#10003;</span> ***Study of the effects of road_description and slope_description.***
* <span style='color: lightgreen ; font-size: 1.5em ; font-weight: bold'>&#10003;</span> ***How long is the delay between incident and report dates?***
* <span style='color: lightgreen ; font-size: 1.5em ; font-weight: bold'>&#10003;</span> ***What type of vihicle is associated with the highest number of casualty***
* <span style='color: lightgreen ; font-size: 1.5em ; font-weight: bold'>&#10003;</span>

In [430]:
# extracting the columns related to datetime from the dataset
time_data = data[['Incident_date','Incident_time','report_date','report_time']]

In [431]:
# preview the extracted cloumns
time_data.head(3)

Unnamed: 0,Incident_date,Incident_time,report_date,report_time
0,2019-01-01,00:00:00,2019-01-02,06:11:00
1,2019-01-01,00:03:00,2020-02-20,13:48:00
2,2019-01-01,00:05:00,2019-01-01,10:35:00


In [432]:
# extracting only the years
year = time_data['Incident_date'].str[:4]
year

0        2019
1        2019
2        2019
3        2019
4        2019
         ... 
81730    2022
81731    2022
81732    2022
81733    2022
81734    2022
Name: Incident_date, Length: 81735, dtype: object

In [433]:
year.unique()

array(['2019', '2020', '2021', '2022'], dtype=object)

<span style='color: lightgreen ; font-size: 1.5em ; font-weight: bold'>&#10003;</span> **We can observe that the record of the accidents was from 2019 to 2022**

In [434]:
# checking how the year with the most record of accident
year_rec = year.value_counts().to_frame()
year_rec

Unnamed: 0,Incident_date
2020,21052
2022,21032
2021,20457
2019,19194


<span style='color: lightgreen ; font-size: 1.5em ; font-weight: bold'>&#10003;</span> **From the analysis, year 2020 has the highest record of accident rate of 21052 out of 81735**

**year 2019 has the lowest record of accident rate of 19194 out of 81735**

Extracting the month values from the dataseet

In [435]:
#extracting the month from the dataset
incident_month = data['Incident_date'].str[5:7].astype(int)
incident_month.head()

0    1
1    1
2    1
3    1
4    1
Name: Incident_date, dtype: int64

creating the dictionary that indicate what month the values represent

In [436]:
month_dico = {1:'January', 2:'February', 3:'March', 4:'April', 5:'May', 6:'June', 7:'July',
              8:'August', 9:'September', 10:'October', 11:'November', 12:'December'}

replacing the values with the actual month name.

In [437]:
# replacing the values with the month name
incident_month.replace(month_dico, inplace=True)

#Checking if the change is applied
incident_month.head()

0    January
1    January
2    January
3    January
4    January
Name: Incident_date, dtype: object

checking for which month has the highest record of accidents..
We can get insight from this. Why the accident rate is high for that particular month

In [438]:
month_rec = incident_month.value_counts().to_frame()
month_rec

Unnamed: 0,Incident_date
December,9657
April,9434
January,8516
March,6503
October,6332
July,6282
November,6133
February,6030
September,5856
May,5794


<span style='color: lightgreen ; font-size: 1.5em ; font-weight: bold'>&#10003;</span> **December is the month with the highest accident record of 9657 out of 81735**

**June is the month with the least accident record of 5411 out of 81735**  

 the wet season (May to October),

 the cool season (November to February)

 and the hot season (March to May).

In [439]:
Wet = ['June', 'July', 'August', 'September', 'October']
Cool = ['November', 'December', 'January', 'February']
Hot = ['March', 'April', 'May']

In [440]:
def season(x):
  for val in Wet:
    if val in x:
      return 'Wet season'
  for vax in Cool:
    if vax in x:
      return 'Cool Season'
  for vaw in Hot:
    if vaw in x:
      return 'Hot season'


In [441]:
month_season = incident_month.apply(season)
month_season

0        Cool Season
1        Cool Season
2        Cool Season
3        Cool Season
4        Cool Season
            ...     
81730    Cool Season
81731    Cool Season
81732    Cool Season
81733    Cool Season
81734    Cool Season
Name: Incident_date, Length: 81735, dtype: object

In [442]:
month_season.unique()

array(['Cool Season', 'Hot season', 'Wet season'], dtype=object)

In [443]:
month_season.value_counts()

Cool Season    30336
Wet season     29668
Hot season     21731
Name: Incident_date, dtype: int64

## **DATA VISUALIZATION**