# Road accidents dataset

The following project is aimed at creating a dataset of 100 rows containing categorical values. The dataset is created with the vision of being used to train and test a machine learning algorithm. The data has to be created in a way for the model to make accurate predictions according to factors that contribute to road accidents occurring.

This project consists of 3 sections:
- Data generation
- Data conversion to binary form
- Creating the crash test data

## Section 1 - Data generation
In this section the data is created and distributed appropriately throughout the dataset and converted into a pandas DataFrame for export in .csv format.

In [1]:
import numpy as np
import pandas as pd
import random
random.seed(1)# seeded for consistent generated values 
np.random.seed(1)

genderList = ["M","F"]
finalGenderList = [] #/column/
ageList = []
finalAgeList = [] #/column/
alcoholList = ["y","n"]
finalAlcoholList = []# /column/
fatiguedList = ["y","n"]
finalFatiguedList = []# /column/
timeOfDayList = ["Day","Night"]
finalTimeOfDayList = []# /column/
rainingList = ["y","n"]
finalRainingList = []# /column/

for i in range(100):# a for loop to create 100 rows for database
    #_______Gender_______
    gender = random.choice(genderList)# instantiate the gender variable
    finalGenderList.append(gender)# populate the final gender list
    
    #______Age___________
    age = random.uniform(18,75)
    finalAgeList.append(int(age))
    
    #______Alcohol_______
    alcohol = random.choices(alcoholList, weights=[0.4,0.6])# weights are used to distribute the values in proportions
    finalAlcoholList.append(alcohol)
    
    #______Fatigue_______
    fatigue = random.choices(fatiguedList, weights=[0.15, 0.85])
    finalFatiguedList.append(fatigue)
    
    #______Time of Day___
    timeOfDay = random.choices(timeOfDayList, weights=[0.5, 0.5])
    finalTimeOfDayList.append(timeOfDay)
    
    #______Rain__________
    rain = random.choices(rainingList, weights=[0.3, 0.7])
    finalRainingList.append(rain)
    

# convert the above lists to pandas series
genderSeries = pd.Series(finalGenderList)
ageSeries = pd.Series(finalAgeList)
alcoholSeries = pd.Series(finalAlcoholList)
fatiguedSeries = pd.Series(finalFatiguedList)
timeOfDaySeries = pd.Series(finalTimeOfDayList)
rainSeries = pd.Series(finalRainingList)

seriesList = [genderSeries, ageSeries, alcoholSeries, fatiguedSeries, 
              timeOfDaySeries, rainSeries]
df = pd.DataFrame(data=seriesList).T
df.rename({0:"Gender",1:"Age",2:"Alcohol Consumed",3:"Fatigued",4:"Time of Day",5:"Raining"}, axis=1, inplace= True)
df.to_csv("DrivingDatabase1.csv", index = False)
# display the created categorical data
df.head(10)# display the categorical dataset

Unnamed: 0,Gender,Age,Alcohol Consumed,Fatigued,Time of Day,Raining
0,M,50,[n],[y],[Day],[n]
1,F,55,[n],[y],[Day],[n]
2,F,52,[n],[n],[Day],[n]
3,M,69,[y],[y],[Night],[n]
4,F,57,[n],[n],[Night],[n]
5,F,49,[y],[n],[Night],[n]
6,M,41,[n],[n],[Day],[n]
7,F,24,[y],[n],[Night],[n]
8,F,46,[n],[n],[Day],[n]
9,F,66,[n],[n],[Day],[y]


## Section 2 - Data conversion to binary form 
In this section the above data is read as a pandas DataFrame and converted into binary format one column at a time and then exported in a new .csv file.

In [2]:
df = pd.read_csv("DrivingDatabase1.csv")# read the database csv file

genderDummy = pd.get_dummies(df["Gender"])# create a dummy data set converting the genders to binary values. M=1
df2 = pd.concat((df, genderDummy), axis=1)# concatenate the binry columns to the table copy
df2 = df2.drop(["Gender"], axis=1)# drop the gender column with yes no vals
df2 = df2.drop(["F"], axis = 1)# drop the female binary gender column
df2 = df2.rename(columns={"M":"Gender"})# rename the male binary column to gender

alcoholDummy = pd.get_dummies(df["Alcohol Consumed"])# yes=1
df2 = pd.concat((df2, alcoholDummy), axis=1)
df2 = df2.drop(["Alcohol Consumed"], axis=1)
df2 = df2.drop(["['n']"], axis=1)
df2 = df2.rename(columns={"['y']":"Alcohol consumed"})

fatiguedDummy = pd.get_dummies(df["Fatigued"])# yes=1
df2 = pd.concat((df2, fatiguedDummy), axis=1)
df2 = df2.drop(["Fatigued"], axis=1)
df2 = df2.drop(["['n']"], axis=1)
df2 = df2.rename(columns={"['y']":"Fatigued"})

timeOfDayDummy = pd.get_dummies(df["Time of Day"])# night=1
df2 = pd.concat((df2, timeOfDayDummy), axis=1)
df2 = df2.drop(["Time of Day"], axis=1)
df2 = df2.drop(["['Day']"], axis=1)
df2 = df2.rename(columns={"['Night']":"Time of Day"})

rainDummy = pd.get_dummies(df["Raining"])# raining=1
df2 = pd.concat((df2, rainDummy), axis=1)
df2 = df2.drop(["Raining"], axis=1)
df2 = df2.drop(["['n']"], axis=1)
df2 = df2.rename(columns={"['y']":"Raining"})

df2.to_csv("BinaryDrivingDatabase1.csv", index=False)# format the df2 data to a csv file for export. Drop index
df2.head(10)# display the converted data

Unnamed: 0,Age,Gender,Alcohol consumed,Fatigued,Time of Day,Raining
0,50,1,0,1,0,0
1,55,0,0,1,0,0
2,52,0,0,0,0,0
3,69,1,1,1,1,0
4,57,0,0,0,1,0
5,49,0,1,0,1,0
6,41,1,0,0,0,0
7,24,0,1,0,1,0
8,46,0,0,0,0,0
9,66,0,0,0,0,1


## Section 3 - Creating the crash test data
In this section the binary data created in the previous section is read as a pandas DataFrame. This DataFrame is then converted to a numpy array where it is then run through a series of conditional statements lining out conditions that are prone to cause road accidents. These values are then exported into a .csv dataset of their own that will  be used to test the ML model. 

In [3]:
df1 = pd.read_csv("BinaryDrivingDatabase1.csv")
crashList = []# list of crashes

dataArray = df1.to_numpy()# convert dataframe to numpy array

counter = 0# keeps track of number of rows read
counter2 = 0# to keep track of nr of crashes

for line in dataArray:# iterate through each line in the array
    # supply a number of conditional statements to create crash scenarios in the data
    if line[2] and line[3] and line[5] == 1:
        crashList.append(1)
        counter2 +=1# increment the counter to keep track of number of crashes
    
    elif line[0] == 71:# age 
        crashList.append(1)# append the case to the crash list
        counter2 +=1# increment the counter keeping track of crashes
    
    elif line[0] == 19 and line[2] == 1:# age and DUI 
        crashList.append(1)
        counter2 +=1
    
    elif line[1] == 0 and line[4] == 1 and line[5] == 1:# gender , fatigued and time of day
        crashList.append(1)
        counter2 +=1
    
    elif line[0] < 26 and line[2] and line[3] == 1:# age , DUI and fatigued
        crashList.append(1)
        counter2 +=1
    
    elif line[0] > 63 and line[2] and line[3] == 1:# age , DUI and fatigued
        crashList.append(1)
        counter2 +=1
    else: crashList.append(0)
    
    counter +=1# increment nr of rows read

#print(crashList, counter2)
crashSeries = pd.Series(crashList)# convert the crash list to a pandas series
df2 = pd.DataFrame(data=crashSeries)# convert the series to a dataframe
df2.to_csv("MachineLearningProj1ACrashTestData.csv", index=False)# export the data in 
df2.head(10)# display the crash test data

Unnamed: 0,0
0,0
1,0
2,0
3,1
4,0
5,0
6,0
7,0
8,0
9,0
