# ANALYZING RTC SEVERITY DATASET
In this project, we are analyzing rioad accident data in order to answer the following questions:
1. What is the Age_of_Vehicle with the most/least accidents?
2. Which Age_Band_of_Driver has recorded the most/least accidents?
3. Which make had the most/least accidents ?
4. Which model has the most accidents ?
5. Which vehicles under different propulsion codes have the most/least accidents?
6. Are there specific Vehicle_Types that are prone to accidents ?
7. How many accidents are recorded in different Driver_Home_Area_Types?
8. Driving at which speed limit leads to accidents ?
9. How many accidents are caused by female/male drivers?

In [1]:
import csv
import datetime as dt
import more_itertools
import locale
import chardet

In [2]:
# get the encoding by reading the first ten lines
with open("C://Users//user//I am learning ML//Analyzing RTC Severity Dataset//vehicle_data.csv", mode='rb') as file:
    raw_bytes = file.read(10)
    detected_encoding = chardet.detect(raw_bytes)['encoding']
    print(detected_encoding)

ascii


In [3]:
print(locale.getpreferredencoding())

cp1252


**CP1252** is the encoding used in the file

# **CONVERT THE FILE FROM *ASCII* ENCODING TO *UTF-8* ENCODING**

## To convert a csv file file from one encoding to another:
1. Convert the file from it's current encoding.
2. Read the file using **csv.reader()**
3. Open the new fike using the desired encoding
4. Loop over the rows of the original file and write them into the new one using **csv.writer()** and the **writerow()** method.

In [4]:
with open ("C://Users//user//I am learning ML//Analyzing RTC Severity Dataset//vehicle_data.csv") as file:
    rows = list(csv.reader(file))

In [5]:
print(rows[:10])

[['Accident_Index', 'Age_Band_of_Driver', 'Age_of_Vehicle', 'Driver_Home_Area_Type', 'Driver_IMD_Decile', 'Engine_Capacity_.CC.', 'Hit_Object_in_Carriageway', 'Hit_Object_off_Carriageway', 'Journey_Purpose_of_Driver', 'Junction_Location', 'make', 'model', 'Propulsion_Code', 'Sex_of_Driver', 'Skidding_and_Overturning', 'Towing_and_Articulation', 'Vehicle_Leaving_Carriageway', 'Vehicle_Location.Restricted_Lane', 'Vehicle_Manoeuvre', 'Vehicle_Reference', 'Vehicle_Type', 'Was_Vehicle_Left_Hand_Drive', 'X1st_Point_of_Impact', 'Year'], ['200401BS00001', '26 - 35', '3', 'Urban area', '4', '1588', 'None', 'None', 'Data missing or out of range', 'Data missing or out of range', 'ROVER', '45 CLASSIC 16V', 'Petrol', 'Male', 'None', 'No tow/articulation', 'Did not leave carriageway', '0', 'Going ahead other', '2', '109', 'Data missing or out of range', 'Front', '2004'], ['200401BS00002', '26 - 35', 'NA', 'Urban area', '3', 'NA', 'None', 'None', 'Data missing or out of range', 'Data missing or out o

In [6]:
header = rows[0]
print(len(header))

24


In [7]:
data = rows[1:]
print(len(data[0]))

24


# EXPLORE THE DATASET

In [8]:
# function to explore the dataset to find out how many rows and columns there are
def explore_dataset(dataset, start, end, rows_columns = False):
    dataset_slice = dataset[start:end]
    for row in dataset_slice:
        print(row)
        print("\n")
    if rows_columns:
        print("Number of rows",len(dataset))
        print("Number of columns", len(dataset[0]))

In [9]:
explore_dataset(data, 0,5,True)

['200401BS00001', '26 - 35', '3', 'Urban area', '4', '1588', 'None', 'None', 'Data missing or out of range', 'Data missing or out of range', 'ROVER', '45 CLASSIC 16V', 'Petrol', 'Male', 'None', 'No tow/articulation', 'Did not leave carriageway', '0', 'Going ahead other', '2', '109', 'Data missing or out of range', 'Front', '2004']


['200401BS00002', '26 - 35', 'NA', 'Urban area', '3', 'NA', 'None', 'None', 'Data missing or out of range', 'Data missing or out of range', 'BMW', 'C1', 'NA', 'Male', 'None', 'No tow/articulation', 'Did not leave carriageway', '0', 'Going ahead other', '1', '109', 'Data missing or out of range', 'Front', '2004']


['200401BS00003', '26 - 35', '4', 'Data missing or out of range', 'NA', '998', 'None', 'None', 'Data missing or out of range', 'Data missing or out of range', 'NISSAN', 'MICRA CELEBRATION 16V', 'Petrol', 'Male', 'None', 'No tow/articulation', 'Did not leave carriageway', '0', 'Turning right', '1', '109', 'Data missing or out of range', 'Front', '2

# CHECK FOR WRONG DATA

In [10]:
# Remove empty lists
data = [sublist for sublist in data if sublist]
explore_dataset(data, 0,5,True)

['200401BS00001', '26 - 35', '3', 'Urban area', '4', '1588', 'None', 'None', 'Data missing or out of range', 'Data missing or out of range', 'ROVER', '45 CLASSIC 16V', 'Petrol', 'Male', 'None', 'No tow/articulation', 'Did not leave carriageway', '0', 'Going ahead other', '2', '109', 'Data missing or out of range', 'Front', '2004']


['200401BS00002', '26 - 35', 'NA', 'Urban area', '3', 'NA', 'None', 'None', 'Data missing or out of range', 'Data missing or out of range', 'BMW', 'C1', 'NA', 'Male', 'None', 'No tow/articulation', 'Did not leave carriageway', '0', 'Going ahead other', '1', '109', 'Data missing or out of range', 'Front', '2004']


['200401BS00003', '26 - 35', '4', 'Data missing or out of range', 'NA', '998', 'None', 'None', 'Data missing or out of range', 'Data missing or out of range', 'NISSAN', 'MICRA CELEBRATION 16V', 'Petrol', 'Male', 'None', 'No tow/articulation', 'Did not leave carriageway', '0', 'Turning right', '1', '109', 'Data missing or out of range', 'Front', '2

In [11]:
print(len(data[0]))

24


# CHECK FOR DUPLICATES

In [12]:
# check for duplicates
duplicate_entries = []
unique_entries = []
for row in data:
    accident_id = row[0]
    if accident_id in unique_entries:
        date_day_dicticate_entries.append(accident_id)
#     else:
#         unique_entries.append(accident_id)
# len_unique_entries = len(unique_entries)
len_duplicate_entries = len(duplicate_entries)
# example_duplicate = duplicate_entries[3:5]
# print("There are", len_unique_entries, "unique entries")
print("There are", len_duplicate_entries, "duplicate entries")
# print(example_duplicate)
#     return len_duplicate_entries, len_unique_entries, example_duplicate
# len_duplicate_entries, len_unique_entries, example_duplicate = check_duplicates(data)
# print(
#     f"Number of duplicate entries {len_duplicate_entries}\n"
#     f"Number of unique entries  {len_unique_entries}\n"
#     f"Examples of duplicate entries {example_duplicate}\n"
#      )

There are 0 duplicate entries


# REPLACE MISSING STRINGS WITH "UNKNOWN DATA"

In [13]:
# fill in the empty strings with the string(unknown Data)
def fill_missing_strings(i):
    for row in data:
        col = row[i]
        col = col.title()
        if not col:
            col = "Unknown Data"
        row[i] = col

In [14]:
for i in range(len(header)):
    fill_missing_strings(i)
print(data[2:5])

[['200401Bs00003', '26 - 35', '4', 'Data Missing Or Out Of Range', 'Na', '998', 'None', 'None', 'Data Missing Or Out Of Range', 'Data Missing Or Out Of Range', 'Nissan', 'Micra Celebration 16V', 'Petrol', 'Male', 'None', 'No Tow/Articulation', 'Did Not Leave Carriageway', '0', 'Turning Right', '1', '109', 'Data Missing Or Out Of Range', 'Front', '2004'], ['200401Bs00003', '66 - 75', 'Na', 'Data Missing Or Out Of Range', 'Na', 'Na', 'None', 'None', 'Data Missing Or Out Of Range', 'Data Missing Or Out Of Range', 'London Taxis Int', 'Txii Gold Auto', 'Na', 'Male', 'None', 'No Tow/Articulation', 'Did Not Leave Carriageway', '0', 'Going Ahead Other', '2', '109', 'Data Missing Or Out Of Range', 'Front', '2004'], ['200401Bs00004', '26 - 35', '1', 'Urban Area', '4', '124', 'None', 'None', 'Data Missing Or Out Of Range', 'Data Missing Or Out Of Range', 'Piaggio', 'Vespa Et4', 'Petrol', 'Male', 'None', 'No Tow/Articulation', 'Did Not Leave Carriageway', '0', 'Going Ahead Other', '1', 'Motorcycle

In [15]:
# get the index of date column
col_index = {}
for i in range(len(header)):
    col_index[header[i]] = i
print(col_index)

{'Accident_Index': 0, 'Age_Band_of_Driver': 1, 'Age_of_Vehicle': 2, 'Driver_Home_Area_Type': 3, 'Driver_IMD_Decile': 4, 'Engine_Capacity_.CC.': 5, 'Hit_Object_in_Carriageway': 6, 'Hit_Object_off_Carriageway': 7, 'Journey_Purpose_of_Driver': 8, 'Junction_Location': 9, 'make': 10, 'model': 11, 'Propulsion_Code': 12, 'Sex_of_Driver': 13, 'Skidding_and_Overturning': 14, 'Towing_and_Articulation': 15, 'Vehicle_Leaving_Carriageway': 16, 'Vehicle_Location.Restricted_Lane': 17, 'Vehicle_Manoeuvre': 18, 'Vehicle_Reference': 19, 'Vehicle_Type': 20, 'Was_Vehicle_Left_Hand_Drive': 21, 'X1st_Point_of_Impact': 22, 'Year': 23}


In [16]:
for row in data:
    myyear = row[23]
print(type(myyear))

<class 'str'>


# PARSE YEAR AS YEAR

In [17]:
# # parse strings as dates
# for row in data:
#     myyear = row[23]
#     myyear = dt.datetime.strptime(myyear, "%Y").year
#     row[23] = myyear
# print(type(row[23]))

In [18]:
print(data[100:104])

[['200401Bs00195', '56 - 65', '6', 'Urban Area', '6', '2435', 'None', 'None', 'Data Missing Or Out Of Range', 'Data Missing Or Out Of Range', 'Volvo', 'V70 Xlt 20V Auto', 'Petrol', 'Female', 'None', 'No Tow/Articulation', 'Did Not Leave Carriageway', '0', 'Going Ahead Other', '1', '109', 'Data Missing Or Out Of Range', 'Front', '2004'], ['200401Bs00197', '36 - 45', '3', 'Urban Area', '6', '599', 'Kerb', 'None', 'Data Missing Or Out Of Range', 'Data Missing Or Out Of Range', 'Kawasaki', 'Unknown Data', 'Petrol', 'Male', 'Skidded', 'No Tow/Articulation', 'Offside', '0', 'Going Ahead Other', '1', '106', 'Data Missing Or Out Of Range', 'Front', '2004'], ['200401Bs00197', '26 - 35', '8', 'Urban Area', '3', '2316', 'None', 'None', 'Data Missing Or Out Of Range', 'Data Missing Or Out Of Range', 'Volvo', '940 Gle Turbo Auto', 'Petrol', 'Female', 'None', 'No Tow/Articulation', 'Did Not Leave Carriageway', '0', 'Turning Right', '2', '109', 'Data Missing Or Out Of Range', 'Offside', '2004'], ['20

# ANALYSIS 

# 1. HOW OLD ARE THE VEHICLES THAT GET INTO ACCIDENTS?

In [19]:
# get the index of date column
col_index = {}
for i in range(len(header)):
    col_index[header[i]] = i
print(col_index)

{'Accident_Index': 0, 'Age_Band_of_Driver': 1, 'Age_of_Vehicle': 2, 'Driver_Home_Area_Type': 3, 'Driver_IMD_Decile': 4, 'Engine_Capacity_.CC.': 5, 'Hit_Object_in_Carriageway': 6, 'Hit_Object_off_Carriageway': 7, 'Journey_Purpose_of_Driver': 8, 'Junction_Location': 9, 'make': 10, 'model': 11, 'Propulsion_Code': 12, 'Sex_of_Driver': 13, 'Skidding_and_Overturning': 14, 'Towing_and_Articulation': 15, 'Vehicle_Leaving_Carriageway': 16, 'Vehicle_Location.Restricted_Lane': 17, 'Vehicle_Manoeuvre': 18, 'Vehicle_Reference': 19, 'Vehicle_Type': 20, 'Was_Vehicle_Left_Hand_Drive': 21, 'X1st_Point_of_Impact': 22, 'Year': 23}


In [22]:
def how_many(dataset):
    result_list = []
    for row in data:
        age_driver = row[1]
        age_vehicle = str(row[2])
        vehicle_type = row[20]
        year = row[-1]
        result_list.append([age_driver,age_vehicle,vehicle_type,year])

    driver_age = {}
    vehicle_age = {}
    vehicle = {}
    my_year = {}
    for result in result_list:
        age_d = result[0]
        age_v = result[1]
        vehicle_t = result[2]
        y = result[3]
 
        if age_d in driver_age:
            driver_age[age_d] += 1
        else:
            driver_age[age_d] = 1
        if age_v in vehicle_age:
            vehicle_age[age_v] += 1
        else:
            vehicle_age[age_v] = 1
        if vehicle_t in vehicle:
            vehicle[vehicle_t] += 1
        else:
            vehicle[vehicle_t] = 1
        if y in my_year:
            my_year[y] += 1
        else:
            my_year[y] = 1
            
    return driver_age, vehicle_age, vehicle, my_year
driver_age, vehicle_age, vehicle, my_year = how_many(data)

def print_first_few_data(dictionary):
    # first 5 key:value pairs
    first_few = more_itertools.take(10, dictionary.items())
    return first_few

def sorted_values(dictionary):
    # sort the dictionary to get it in descending order
    #sort to see the when most accidents occured and the least
    sorted_dict = dict(sorted(dictionary.items(), reverse = True, key=lambda item: item[1]))
    sorted_dict1 = print_first_few_data(sorted_dict)
    return sorted_dict1        

In [23]:
print("Accidents from age bands: \n ", print_first_few_data(driver_age))
print("Accidents from age of vehicles: \n ", print_first_few_data(vehicle_age))
print("The first ten vehicles: \n ", print_first_few_data(vehicle))
print("The first ten accidents per year are: \n ", print_first_few_data(my_year))
print("Accidents from age bands in ascending order:\n", sorted_values(driver_age ))
print("Accidents from age of vehicles in ascending order:\n", sorted_values(vehicle_age))
print("The first ten vehicle types in ascending order:\n", sorted_values(vehicle))
print("The sorted number of accidents per year are: \n ", sorted_values(my_year))

Accidents from age bands: 
  [('26 - 35', 450531), ('66 - 75', 91454), ('36 - 45', 435686), ('46 - 55', 348762), ('21 - 25', 238765), ('Data Missing Or Out Of Range', 171052), ('16 - 20', 175874), ('56 - 65', 206181), ('Over 75', 54236), ('11 - 15', 3655)]
Accidents from age of vehicles: 
  [('3', 148665), ('Na', 358149), ('4', 144493), ('1', 180333), ('10', 113461), ('2', 161072), ('11', 100439), ('6', 134524), ('9', 121279), ('5', 138464)]
The first ten vehicles: 
  [('109', 82920), ('Motorcycle 125Cc And Under', 61600), ('Van / Goods 3.5 Tonnes Mgw Or Under', 117427), ('Bus Or Coach (17 Or More Pass Seats)', 76757), ('Goods 7.5 Tonnes Mgw And Over', 55426), ('108', 1334), ('Motorcycle 50Cc And Under', 22415), ('106', 7568), ('Other Vehicle', 13994), ('Goods Over 3.5T. And Under 7.5T', 18236)]
The first ten accidents per year are: 
  [('2004', 118797), ('2005', 112288), ('2006', 115017), ('2007', 127172), ('2008', 122445), ('2009', 182321), ('2010', 180367), ('2011', 180616), ('2012'

# RESULTS

1. Newer vehicles get involved in more accidents
2. Most accidents are caused by cars.
3. 2015 had the most accidents.
4. Most recorded accidents were caused by drivers aged between 26 and 35.