# Table of Contents 
1. [Introduction](#id0)
2. [Import Packages](#id1)
3. [Data Import from FARS National Survey](#id2)
4. [Making DataFrames](#id3)
5. [Subsetting before initial assessments of column types](#id4)
6. [Merging DataFrames](#id5)
7. [Checking Column Data Types](#id6)



<a id="id0"></a>

## Introduction
Data was drawn from the Fatality Analysis Reporting System (FARS) of the National Highway Traffic Safety Administration (NHTSA). As the FARS manual notes, " Crashes each year result in thousands of lives lost, hundreds of thousands of injured victims, and billions of dollars in property damage. Accurate data are required to support
the development, implementation, and assessment of highway safety programs aimed at reducing
this toll." The FARS data was collected to help improve traveler safety. The FARS data are specific in that " To qualify as a FARS case, the crash had to involve a motor vehicle traveling on a
trafficway customarily open to the public, and must have resulted in the death of a motorist or a
non-motorist within 30 days of the crash."

National Center for Statistics and Analysis. (2022, March). *Fatality Analysis Reporting System
analytical user’s manual*, 1975-2020 (Report No. DOT HS 813 254). National Highway
Traffic Safety Administration. 

<a id="id1"></a>

In [432]:

import pandas as pd
import csv
import os


<a id="id2"></a>

## Data Import from FARS National Survey

In [433]:
### downloaded 19 CSVs from here https://www.nhtsa.gov/file-downloads?p=nhtsa/downloads/FARS/2019/ 
data_accident = "/Users/kirk/DS/Springboard_DST/Capstone_2/data/external/FARS2019NationalCSV/accident.CSV"
data_person = "/Users/kirk/DS/Springboard_DST/Capstone_2/data/external/FARS2019NationalCSV/Person.CSV"
#data_cevent = "/Users/kirk/DS/Springboard_DST/Capstone_2/data/external/FARS2019NationalCSV/CEvent.CSV"
data_factor = "/Users/kirk/DS/Springboard_DST/Capstone_2/data/external/FARS2019NationalCSV/Factor.CSV"
data_vision = "/Users/kirk/DS/Springboard_DST/Capstone_2/data/external/FARS2019NationalCSV/Vision.csv"
data_drugs = "/Users/kirk/DS/Springboard_DST/Capstone_2/data/external/FARS2019NationalCSV/Drugs.csv"
#data_race = "/Users/kirk/DS/Springboard_DST/Capstone_2/data/external/FARS2019NationalCSV/Race.CSV"

<a id="id3"></a>

## Making DataFrames

In [434]:
table_accident = pd.read_csv(data_accident, low_memory=False)
table_accident.head(10)

Unnamed: 0,STATE,STATENAME,ST_CASE,VE_TOTAL,VE_FORMS,PVH_INVL,PEDS,PERSONS,PERMVIT,PERNOTMVIT,...,HOSP_MN,HOSP_MNNAME,CF1,CF1NAME,CF2,CF2NAME,CF3,CF3NAME,FATALS,DRUNK_DR
0,1,Alabama,10001,2,2,0,0,3,3,0,...,27,27,0,,0,,0,,1,1
1,1,Alabama,10002,2,2,0,0,2,2,0,...,99,Unknown EMS Hospital Arrival Time,0,,0,,0,,1,0
2,1,Alabama,10003,3,3,0,0,4,4,0,...,5,5,14,"Motor Vehicle struck by falling cargo,or somet...",0,,0,,1,0
3,1,Alabama,10004,1,1,0,1,1,1,1,...,88,Not Applicable (Not Transported),0,,0,,0,,1,0
4,1,Alabama,10005,1,1,0,0,1,1,0,...,88,Not Applicable (Not Transported),0,,0,,0,,1,1
5,1,Alabama,10006,2,2,0,0,2,2,0,...,88,Not Applicable (Not Transported),0,,0,,0,,1,0
6,1,Alabama,10007,1,1,0,0,5,5,0,...,99,Unknown EMS Hospital Arrival Time,0,,0,,0,,1,0
7,1,Alabama,10008,1,1,0,0,1,1,0,...,88,Not Applicable (Not Transported),0,,0,,0,,1,1
8,1,Alabama,10009,1,1,0,0,1,1,0,...,88,Not Applicable (Not Transported),0,,0,,0,,1,0
9,1,Alabama,10010,1,1,0,1,1,1,1,...,88,Not Applicable (Not Transported),0,,0,,0,,1,0


In [435]:
table_person = pd.read_csv(data_person, low_memory=False)
table_person.head(10)

Unnamed: 0,STATE,STATENAME,ST_CASE,VE_FORMS,VEH_NO,PER_NO,STR_VEH,COUNTY,DAY,DAYNAME,...,WORK_INJ,WORK_INJNAME,HISPANIC,HISPANICNAME,LOCATION,LOCATIONNAME,HELM_USE,HELM_USENAME,HELM_MIS,HELM_MISNAME
0,1,Alabama,10001,2,1,1,0,81,7,7,...,8,Not Applicable (not a fatality),0,Not A Fatality (not Applicable),0,Occupant of a Motor Vehicle,20,Not Applicable,7,None Used/Not Applicable
1,1,Alabama,10001,2,1,2,0,81,7,7,...,0,No,7,Non-Hispanic,0,Occupant of a Motor Vehicle,20,Not Applicable,7,None Used/Not Applicable
2,1,Alabama,10001,2,2,1,0,81,7,7,...,8,Not Applicable (not a fatality),0,Not A Fatality (not Applicable),0,Occupant of a Motor Vehicle,20,Not Applicable,7,None Used/Not Applicable
3,1,Alabama,10002,2,1,1,0,55,23,23,...,0,No,7,Non-Hispanic,0,Occupant of a Motor Vehicle,20,Not Applicable,7,None Used/Not Applicable
4,1,Alabama,10002,2,2,1,0,55,23,23,...,8,Not Applicable (not a fatality),0,Not A Fatality (not Applicable),0,Occupant of a Motor Vehicle,20,Not Applicable,7,None Used/Not Applicable
5,1,Alabama,10003,3,1,1,0,29,22,22,...,8,Not Applicable (not a fatality),0,Not A Fatality (not Applicable),0,Occupant of a Motor Vehicle,20,Not Applicable,7,None Used/Not Applicable
6,1,Alabama,10003,3,1,2,0,29,22,22,...,0,No,7,Non-Hispanic,0,Occupant of a Motor Vehicle,20,Not Applicable,7,None Used/Not Applicable
7,1,Alabama,10003,3,2,1,0,29,22,22,...,8,Not Applicable (not a fatality),0,Not A Fatality (not Applicable),0,Occupant of a Motor Vehicle,20,Not Applicable,7,None Used/Not Applicable
8,1,Alabama,10003,3,3,1,0,29,22,22,...,8,Not Applicable (not a fatality),0,Not A Fatality (not Applicable),0,Occupant of a Motor Vehicle,20,Not Applicable,7,None Used/Not Applicable
9,1,Alabama,10004,1,0,1,1,55,22,22,...,0,No,7,Non-Hispanic,11,"Not at Intersection - On Roadway, Not in Marke...",96,Not a Motor Vehicle Occupant,8,Not a Motor Vehicle Occupant


In [436]:
table_factor = pd.read_csv(data_factor, low_memory=False)
table_factor.head()

Unnamed: 0,STATE,STATENAME,ST_CASE,VEH_NO,MFACTOR,MFACTORNAME
0,1,Alabama,10001,1,0,
1,1,Alabama,10001,2,0,
2,1,Alabama,10002,1,0,
3,1,Alabama,10002,2,0,
4,1,Alabama,10003,1,0,


In [437]:
### checking out factor names in factors
pd.unique(table_factor['MFACTORNAME'])

array(['None', 'Reported as Unknown',
       'Vehicle Contributing Factors - No Details', 'Tires',
       'Other Lights', 'Other', 'Brake System', 'Head Lights',
       'Not Reported', 'Steering', 'Windows/Windshield',
       'Truck Coupling/Trailer Hitch/Safety Chains', 'Suspension',
       'Safety Systems', 'Body, Doors', 'Wheels', 'Mirrors',
       'Power Train', 'Exhaust System', 'Signal Lights', 'Wipers'],
      dtype=object)

In [438]:
table_vision = pd.read_csv(data_vision, low_memory=False)
table_vision.head()

Unnamed: 0,STATE,STATENAME,ST_CASE,VEH_NO,MVISOBSC,MVISOBSCNAME
0,1,Alabama,10001,1,0,No Obstruction Noted
1,1,Alabama,10001,2,0,No Obstruction Noted
2,1,Alabama,10002,1,0,No Obstruction Noted
3,1,Alabama,10002,2,0,No Obstruction Noted
4,1,Alabama,10003,1,0,No Obstruction Noted


In [439]:
### checking out factor names in vision
pd.unique(table_vision['MVISOBSCNAME'])

array(['No Obstruction Noted',
       'Curve, Hill or Other Roadway Design Feature',
       'No Driver Present/Unknown if Driver present',
       'Reflected Glare, Bright Sunlight, Headlights',
       'In-Transport Motor Vehicle (including load)',
       'Rain, Snow, Fog, Smoke, Sand, Dust', 'Reported as Unknown',
       'Vision Obscured - No Details',
       'Not In-Transport Motor Vehicle (parked, working)',
       'Other Visual Obstruction', 'Inadequate Defrost or Defog System',
       'Trees, Crops, Vegetation', 'Splash or Spray of Passing Vehicle',
       'Obstruction Interior to the Vehicle',
       'Obstructing Angles on Vehicle',
       'Inadequate Vehicle Lighting System', 'External Mirrors',
       'Broken or Improperly Cleaned Windshield',
       'Building, Billboard, Other Structure'], dtype=object)

<a id="id2"></a>

In [440]:
table_drugs = pd.read_csv(data_drugs, low_memory=False)
table_drugs.head()

Unnamed: 0,STATE,STATENAME,ST_CASE,VEH_NO,PER_NO,DRUGSPEC,DRUGSPECNAME,DRUGRES,DRUGRESNAME
0,1,Alabama,10001,1,1,0,Test Not Given,0,Test Not Given
1,1,Alabama,10001,1,2,0,Test Not Given,0,Test Not Given
2,1,Alabama,10001,2,1,0,Test Not Given,0,Test Not Given
3,1,Alabama,10002,1,1,1,Whole Blood,401,AMPHETAMINE
4,1,Alabama,10002,1,1,1,Whole Blood,417,METHAMPHETAMINE


<a id="id4"></a>

## Subsetting before initial assessments of column types

In [441]:
### Data wrangling challenge will be to combine best columns from different DataFrames to one
table_accident.columns

Index(['STATE', 'STATENAME', 'ST_CASE', 'VE_TOTAL', 'VE_FORMS', 'PVH_INVL',
       'PEDS', 'PERSONS', 'PERMVIT', 'PERNOTMVIT', 'COUNTY', 'COUNTYNAME',
       'CITY', 'CITYNAME', 'DAY', 'DAYNAME', 'MONTH', 'MONTHNAME', 'YEAR',
       'DAY_WEEK', 'DAY_WEEKNAME', 'HOUR', 'HOURNAME', 'MINUTE', 'MINUTENAME',
       'NHS', 'NHSNAME', 'ROUTE', 'ROUTENAME', 'TWAY_ID', 'TWAY_ID2',
       'RUR_URB', 'RUR_URBNAME', 'FUNC_SYS', 'FUNC_SYSNAME', 'RD_OWNER',
       'RD_OWNERNAME', 'MILEPT', 'MILEPTNAME', 'LATITUDE', 'LATITUDENAME',
       'LONGITUD', 'LONGITUDNAME', 'SP_JUR', 'SP_JURNAME', 'HARM_EV',
       'HARM_EVNAME', 'MAN_COLL', 'MAN_COLLNAME', 'RELJCT1', 'RELJCT1NAME',
       'RELJCT2', 'RELJCT2NAME', 'TYP_INT', 'TYP_INTNAME', 'WRK_ZONE',
       'WRK_ZONENAME', 'REL_ROAD', 'REL_ROADNAME', 'LGT_COND', 'LGT_CONDNAME',
       'WEATHER1', 'WEATHER1NAME', 'WEATHER2', 'WEATHER2NAME', 'WEATHER',
       'WEATHERNAME', 'SCH_BUS', 'SCH_BUSNAME', 'RAIL', 'RAILNAME', 'NOT_HOUR',
       'NOT_HOURNAME', 'NOT

In [442]:
### Really large column list for person data
print(table_person.columns.tolist())

['STATE', 'STATENAME', 'ST_CASE', 'VE_FORMS', 'VEH_NO', 'PER_NO', 'STR_VEH', 'COUNTY', 'DAY', 'DAYNAME', 'MONTH', 'MONTHNAME', 'HOUR', 'HOURNAME', 'MINUTE', 'MINUTENAME', 'RUR_URB', 'RUR_URBNAME', 'FUNC_SYS', 'FUNC_SYSNAME', 'HARM_EV', 'HARM_EVNAME', 'MAN_COLL', 'MAN_COLLNAME', 'SCH_BUS', 'SCH_BUSNAME', 'MAKE', 'MAKENAME', 'MAK_MOD', 'BODY_TYP', 'BODY_TYPNAME', 'MOD_YEAR', 'MOD_YEARNAME', 'TOW_VEH', 'TOW_VEHNAME', 'SPEC_USE', 'SPEC_USENAME', 'EMER_USE', 'EMER_USENAME', 'ROLLOVER', 'ROLLOVERNAME', 'IMPACT1', 'IMPACT1NAME', 'FIRE_EXP', 'FIRE_EXPNAME', 'AGE', 'AGENAME', 'SEX', 'SEXNAME', 'PER_TYP', 'PER_TYPNAME', 'INJ_SEV', 'INJ_SEVNAME', 'SEAT_POS', 'SEAT_POSNAME', 'REST_USE', 'REST_USENAME', 'REST_MIS', 'REST_MISNAME', 'AIR_BAG', 'AIR_BAGNAME', 'EJECTION', 'EJECTIONNAME', 'EJ_PATH', 'EJ_PATHNAME', 'EXTRICAT', 'EXTRICATNAME', 'DRINKING', 'DRINKINGNAME', 'ALC_DET', 'ALC_DETNAME', 'ALC_STATUS', 'ALC_STATUSNAME', 'ATST_TYP', 'ATST_TYPNAME', 'ALC_RES', 'ALC_RESNAME', 'DRUGS', 'DRUGSNAME', 'D

In [443]:
#### Need to subset for important columns only in accident, winnow the other tables, and then combine
df_accident = table_accident[['STATENAME', 'ST_CASE','PEDS', 'RUR_URBNAME', 'HARM_EVNAME', 
                             'TYP_INTNAME', 'LGT_CONDNAME', 'WEATHER1NAME', 'WEATHER2NAME', 
                              'WEATHERNAME','FATALS','DRUNK_DR']]    

df_person = table_person[['STATE', 'ST_CASE', 'VE_FORMS', 'VEH_NO', 'PER_NO', 'AGE', 'SEXNAME','PER_TYPNAME','DRINKINGNAME',]]

df_factor = table_factor[['ST_CASE', 'VEH_NO', 'MFACTORNAME']]

df_vision = table_vision[['ST_CASE', 'VEH_NO','MVISOBSCNAME']]

df_drugs = table_drugs[['ST_CASE', 'VEH_NO', 'PER_NO', 'DRUGRESNAME']]


In [444]:
df_accident.columns

Index(['STATENAME', 'ST_CASE', 'PEDS', 'RUR_URBNAME', 'HARM_EVNAME',
       'TYP_INTNAME', 'LGT_CONDNAME', 'WEATHER1NAME', 'WEATHER2NAME',
       'WEATHERNAME', 'FATALS', 'DRUNK_DR'],
      dtype='object')

In [445]:
df_person.columns

Index(['STATE', 'ST_CASE', 'VE_FORMS', 'VEH_NO', 'PER_NO', 'AGE', 'SEXNAME',
       'PER_TYPNAME', 'DRINKINGNAME'],
      dtype='object')

In [446]:
df_factor.columns

Index(['ST_CASE', 'VEH_NO', 'MFACTORNAME'], dtype='object')

In [447]:
df_vision.columns

Index(['ST_CASE', 'VEH_NO', 'MVISOBSCNAME'], dtype='object')

In [448]:
df_drugs.columns

Index(['ST_CASE', 'VEH_NO', 'PER_NO', 'DRUGRESNAME'], dtype='object')

In [449]:
### But drugs names data are too complex for these purposes; let's hold out on simplifing to drugs or no_drugs
df_drugs['DRUGRESNAME'].value_counts()


Test Not Given                     48439
Tested, No Drugs Found/Negative    14313
Not Reported                        6414
Other Drug                          6209
Tetrahydrocannabinols (THC)         3296
                                   ...  
Cloxazolam                             1
Hydroxypethidine                       1
Methylone                              1
ETHYLMORPHINE                          1
Methenolone                            1
Name: DRUGRESNAME, Length: 133, dtype: int64

In [461]:
#pd.unique(df_drugs['DRUGRESNAME'])


In [None]:

#df_drugs.loc[:,'DRUGRESNAME'] = df_drugs['DRUGRESNAME'].astype('category')
#df_drugs['DRUGRESNAME'] = df_drugs['DRUGRESNAME'].cat.rename_categories({'Test Not Given':'no_drugs'})
                                                                         
                                                                         

In [None]:
#df_drugs.loc[:,'DRUGRESNAME'] = df_drugs['DRUGRESNAME'].cat.rename_categories({'Tested, No Drugs Found/Negative':'no_drugs','Test For Drug, Results Unknown':'no_drugs', 'Tested, No Drugs Found/Negative':'no_drugs','Test For Drug, Results Unknown':'no_drugs', 'DELTA 9':'drugs', 'Tetrahydrocannabinols (THC)': 'drugs', 'Nordiazepam': 'drugs', 'HYDROCODONE': 'drugs', 'CHLORDIAZEPOXIDE': 'drugs', 'ALPRAZOLAM': 'drugs', 'MORPHINE': 'drugs', 'Clonazepam': 'drugs','Lormetazepam': 'drugs', 'DIAZEPAM': 'drugs', 'OXYCODONE': 'drugs', 'BENZODIAZEPINES': 'drugs','OXAZEPAM': 'drugs', 'TEMAZEPAM': 'drugs', 'Buprenorphine': 'drugs', 'Other Drug': 'drugs', 'Zolpidem': 'drugs','COCAINE': 'drugs', 'BENZOYLECGONINE': 'drugs', 'PHENOBARBITAL': 'drugs', '"Cannabinoid': 'drugs', 'HASHISH OIL': 'drugs', 'Midazolam': 'drugs', 'METHADONE': 'drugs', 'Butalbital': 'drugs', 'FENTANYL': 'drugs', 'CODEINE': 'drugs', 'PSILOCYN': 'drugs','Tested For Drugs':'drugs', 'Lysergic Acid Diethylamide (LSD)': 'drugs', 'Ketamine': 'drugs', 'LORAZEPAM': 'drugs', 'Reported as Unknown if Tested for Drugs' : 'no_drugs', 'Not Reported': 'no_drugs','MARIJUANA/Marihuana': 'drugs', 'AMOBARBITAL': 'drugs', 'PENTOBARBITAL': 'drugs','PHENTERMINE': 'drugs', 'PHENCYCLIDINE': 'drugs', 'Oxymorphone': 'drugs', 'HYDROMORPHONE': 'drugs','MEPROBAMATE': 'drugs', 'Fenproporex': 'drugs', 'Acetaminophen + Codeine': 'drugs','Loprazolam': 'drugs', 'Carisoprodol': 'drugs', 'OPIUM': 'drugs', '"Narcotics: drugs': 'drugs','Cathine (Norpseudoephedrine)': 'drugs', 'Levorphanol': 'drugs', 'Ecgonine': 'drugs','Methylenedioxymethamphetamine (MDMA)': 'drugs', 'DIHYDROCODEINE': 'drugs','Heroin (Diacetylmorphine)': 'drugs', 'Parahexyl (Synhexyl)': 'drugs', 'Hydroxyzine': 'drugs','Carfentanil': 'drugs', 'Gamma Hydroxybutryate (GHB)': 'drugs', 'BARBITURATES': 'drugs','Myrophine': 'drugs', 'Methoxy-Methylenedioxyamphetamine': 'drugs','Methylenedioxyamphetamine (MDA)': 'drugs', '"Inhalants': 'drugs','Alpha-meprodine': 'drugs', '"PCP':'drugs', 'METHYLPHENIDATE': 'drugs','Levormethorphan': 'drugs', 'Chlorphentermine': 'drugs', 'Alpha':'drugs', 'Levomoramide': 'drugs', 'Acetyl-alpha-methylfentanyl': 'drugs', 'Methoxy-NN-diisopropyltryptamine': 'drugs','BUTABARBITAL': 'drugs', 'Hydroxy-Nortestosterone': 'drugs', 'AMPHETAMINE SULFATE': 'drugs','Pregabalin': 'drugs', '"Depressants':'drugs','Stimulants':'drugs', 'Oxazolam': 'drugs', 'Clotiazepam': 'drugs','Zopiclone': 'drugs', 'AMPHETAMINE VARIANTS': 'drugs', 'Delorazepam': 'drugs','Opium extract': 'drugs', 'Phencyclohexylamine': 'drugs', 'Dimethylamphetamine': 'drugs','Aerosols (hydrocarbon)': 'drugs', 'Diprenorphine': 'drugs', 'Phenampromide': 'drugs','Norlevorphanol': 'drugs', 'Dextromoramide': 'drugs', 'TRAIZOLAM': 'drugs', 'Alfentanil': 'drugs','Coca Leaves': 'drugs', 'Estazolam': 'drugs', 'MEPERIDINE (Pethidine)': 'drugs','Dimethyltryptamine(DMT)': 'drugs', 'Cloxazolam': 'drugs', 'Hydroxypethidine': 'drugs','Methylone': 'drugs', 'ETHYLMORPHINE': 'drugs', 'Benzylmorphine': 'drugs', 'Oxandrolone': 'drugs', 'phenylcyclohexylamine': 'drugs', '"Hallucinogens': 'drugs', 'Thenylfentanyl': 'drugs', 'BUFOTENINE': 'drugs', 'Metazocine': 'drugs','Codeine combination product 90 mg/du': 'drugs', 'Morphine combination product/50 mg/100 ml or gm': 'drugs', 'SECOBARBITAL': 'drugs', 'Flunitrazepam': 'drugs', 'Dextropropoxyphene (dosage forms)': 'drugs', 'Stimulant compounds previously excepted': 'drugs', 'Methyl-dimethoxyamphetamine': 'drugs', 'Phenadoxone': 'drugs', 'Anesthetic Gases': 'drugs', 'PENTAZOCINE': 'drugs', 'PSILOCYBIN': 'drugs', 'Piminodine': 'drugs', 'Fluoxymesterone': 'drugs', 'Mecloqualone': 'drugs', 'Butobarbital (butethal)': 'drugs', 'Morpheridine': 'drugs', 'Metopon': 'drugs','Dihydrocodeine combination product 90 mg/du': 'drugs','Zaleplon': 'drugs', 'Methenolone’: ‘drugs'})


<a id="id5"></a>

## Merging DataFrames

In [None]:
### Person will not be mergerd with drugs currently by state case, vehicle no, and person no. 
### Factor will need to be indexed with state case and veh_no; Vision also.


In [None]:
### Person will be mergerd with drugs by state case, vehicle no, and person no; could be melted to several columns? 
#df_merged_person = df_person.merge(df_drugs, on=['ST_CASE','VEH_NO', 'PER_NO'], how = 'left')
#df_merged_person.head()

In [452]:
### Adding factor data
df_merged_accident = df_accident.merge(df_factor, on=['ST_CASE'], how = 'left')
df_merged_accident.head(10)

Unnamed: 0,STATENAME,ST_CASE,PEDS,RUR_URBNAME,HARM_EVNAME,TYP_INTNAME,LGT_CONDNAME,WEATHER1NAME,WEATHER2NAME,WEATHERNAME,FATALS,DRUNK_DR,VEH_NO,MFACTORNAME
0,Alabama,10001,0,Urban,Motor Vehicle In-Transport,Not an Intersection,Daylight,Clear,No Additional Atmospheric Conditions,Clear,1,1,1,
1,Alabama,10001,0,Urban,Motor Vehicle In-Transport,Not an Intersection,Daylight,Clear,No Additional Atmospheric Conditions,Clear,1,1,2,
2,Alabama,10002,0,Urban,Motor Vehicle In-Transport,Not an Intersection,Dark - Not Lighted,Rain,No Additional Atmospheric Conditions,Rain,1,0,1,
3,Alabama,10002,0,Urban,Motor Vehicle In-Transport,Not an Intersection,Dark - Not Lighted,Rain,No Additional Atmospheric Conditions,Rain,1,0,2,
4,Alabama,10003,0,Rural,Motor Vehicle In-Transport,Not an Intersection,Dark - Not Lighted,Cloudy,No Additional Atmospheric Conditions,Cloudy,1,0,1,
5,Alabama,10003,0,Rural,Motor Vehicle In-Transport,Not an Intersection,Dark - Not Lighted,Cloudy,No Additional Atmospheric Conditions,Cloudy,1,0,2,
6,Alabama,10003,0,Rural,Motor Vehicle In-Transport,Not an Intersection,Dark - Not Lighted,Cloudy,No Additional Atmospheric Conditions,Cloudy,1,0,3,
7,Alabama,10004,1,Rural,Pedestrian,Not an Intersection,Dark - Not Lighted,Clear,No Additional Atmospheric Conditions,Clear,1,0,1,
8,Alabama,10005,0,Urban,Rollover/Overturn,Not an Intersection,Dark - Not Lighted,"Fog, Smog, Smoke",No Additional Atmospheric Conditions,"Fog, Smog, Smoke",1,1,1,
9,Alabama,10006,0,Rural,Motor Vehicle In-Transport,Not an Intersection,Daylight,Clear,No Additional Atmospheric Conditions,Clear,1,0,1,


In [453]:
### reordering columns to get VEH_NO next to ST_CASE 
cols = ['STATENAME', 'ST_CASE', 'VEH_NO', 'FATALS', 'DRUNK_DR', 'PEDS', 'RUR_URBNAME', 'HARM_EVNAME',
       'TYP_INTNAME', 'LGT_CONDNAME', 'WEATHER1NAME', 'WEATHER2NAME',
       'WEATHERNAME',  'MFACTORNAME']
df_merged_accident = df_merged_accident[cols]

In [454]:
df_merged_accident.columns

Index(['STATENAME', 'ST_CASE', 'VEH_NO', 'FATALS', 'DRUNK_DR', 'PEDS',
       'RUR_URBNAME', 'HARM_EVNAME', 'TYP_INTNAME', 'LGT_CONDNAME',
       'WEATHER1NAME', 'WEATHER2NAME', 'WEATHERNAME', 'MFACTORNAME'],
      dtype='object')

In [455]:
df_merged_accident.columns

Index(['STATENAME', 'ST_CASE', 'VEH_NO', 'FATALS', 'DRUNK_DR', 'PEDS',
       'RUR_URBNAME', 'HARM_EVNAME', 'TYP_INTNAME', 'LGT_CONDNAME',
       'WEATHER1NAME', 'WEATHER2NAME', 'WEATHERNAME', 'MFACTORNAME'],
      dtype='object')

In [456]:
### Adding vision
df_merged_accident = df_merged_accident.merge(df_vision, on=['ST_CASE', 'VEH_NO',], how = 'left')
df_merged_accident.head(10)


Unnamed: 0,STATENAME,ST_CASE,VEH_NO,FATALS,DRUNK_DR,PEDS,RUR_URBNAME,HARM_EVNAME,TYP_INTNAME,LGT_CONDNAME,WEATHER1NAME,WEATHER2NAME,WEATHERNAME,MFACTORNAME,MVISOBSCNAME
0,Alabama,10001,1,1,1,0,Urban,Motor Vehicle In-Transport,Not an Intersection,Daylight,Clear,No Additional Atmospheric Conditions,Clear,,No Obstruction Noted
1,Alabama,10001,2,1,1,0,Urban,Motor Vehicle In-Transport,Not an Intersection,Daylight,Clear,No Additional Atmospheric Conditions,Clear,,No Obstruction Noted
2,Alabama,10002,1,1,0,0,Urban,Motor Vehicle In-Transport,Not an Intersection,Dark - Not Lighted,Rain,No Additional Atmospheric Conditions,Rain,,No Obstruction Noted
3,Alabama,10002,2,1,0,0,Urban,Motor Vehicle In-Transport,Not an Intersection,Dark - Not Lighted,Rain,No Additional Atmospheric Conditions,Rain,,No Obstruction Noted
4,Alabama,10003,1,1,0,0,Rural,Motor Vehicle In-Transport,Not an Intersection,Dark - Not Lighted,Cloudy,No Additional Atmospheric Conditions,Cloudy,,No Obstruction Noted
5,Alabama,10003,2,1,0,0,Rural,Motor Vehicle In-Transport,Not an Intersection,Dark - Not Lighted,Cloudy,No Additional Atmospheric Conditions,Cloudy,,No Obstruction Noted
6,Alabama,10003,3,1,0,0,Rural,Motor Vehicle In-Transport,Not an Intersection,Dark - Not Lighted,Cloudy,No Additional Atmospheric Conditions,Cloudy,,No Obstruction Noted
7,Alabama,10004,1,1,0,1,Rural,Pedestrian,Not an Intersection,Dark - Not Lighted,Clear,No Additional Atmospheric Conditions,Clear,,No Obstruction Noted
8,Alabama,10005,1,1,1,0,Urban,Rollover/Overturn,Not an Intersection,Dark - Not Lighted,"Fog, Smog, Smoke",No Additional Atmospheric Conditions,"Fog, Smog, Smoke",,No Obstruction Noted
9,Alabama,10006,1,1,0,0,Rural,Motor Vehicle In-Transport,Not an Intersection,Daylight,Clear,No Additional Atmospheric Conditions,Clear,,No Obstruction Noted


In [457]:
### Adding df_merged_person to df_merged_accident
df_merged_accident = df_merged_accident.merge(df_person, on=['ST_CASE', 'VEH_NO'], how = 'left')
df_merged_accident.head(10)


Unnamed: 0,STATENAME,ST_CASE,VEH_NO,FATALS,DRUNK_DR,PEDS,RUR_URBNAME,HARM_EVNAME,TYP_INTNAME,LGT_CONDNAME,...,WEATHERNAME,MFACTORNAME,MVISOBSCNAME,STATE,VE_FORMS,PER_NO,AGE,SEXNAME,PER_TYPNAME,DRINKINGNAME
0,Alabama,10001,1,1,1,0,Urban,Motor Vehicle In-Transport,Not an Intersection,Daylight,...,Clear,,No Obstruction Noted,1.0,2.0,1.0,34.0,Female,Driver of a Motor Vehicle In-Transport,No (Alcohol Not Involved)
1,Alabama,10001,1,1,1,0,Urban,Motor Vehicle In-Transport,Not an Intersection,Daylight,...,Clear,,No Obstruction Noted,1.0,2.0,2.0,53.0,Male,Passenger of a Motor Vehicle In-Transport,Not Reported
2,Alabama,10001,2,1,1,0,Urban,Motor Vehicle In-Transport,Not an Intersection,Daylight,...,Clear,,No Obstruction Noted,1.0,2.0,1.0,59.0,Male,Driver of a Motor Vehicle In-Transport,Yes (Alcohol Involved)
3,Alabama,10002,1,1,0,0,Urban,Motor Vehicle In-Transport,Not an Intersection,Dark - Not Lighted,...,Rain,,No Obstruction Noted,1.0,2.0,1.0,42.0,Female,Driver of a Motor Vehicle In-Transport,Reported as Unknown
4,Alabama,10002,2,1,0,0,Urban,Motor Vehicle In-Transport,Not an Intersection,Dark - Not Lighted,...,Rain,,No Obstruction Noted,1.0,2.0,1.0,54.0,Female,Driver of a Motor Vehicle In-Transport,No (Alcohol Not Involved)
5,Alabama,10003,1,1,0,0,Rural,Motor Vehicle In-Transport,Not an Intersection,Dark - Not Lighted,...,Cloudy,,No Obstruction Noted,1.0,3.0,1.0,22.0,Male,Driver of a Motor Vehicle In-Transport,Reported as Unknown
6,Alabama,10003,1,1,0,0,Rural,Motor Vehicle In-Transport,Not an Intersection,Dark - Not Lighted,...,Cloudy,,No Obstruction Noted,1.0,3.0,2.0,19.0,Male,Passenger of a Motor Vehicle In-Transport,Not Reported
7,Alabama,10003,2,1,0,0,Rural,Motor Vehicle In-Transport,Not an Intersection,Dark - Not Lighted,...,Cloudy,,No Obstruction Noted,1.0,3.0,1.0,22.0,Male,Driver of a Motor Vehicle In-Transport,No (Alcohol Not Involved)
8,Alabama,10003,3,1,0,0,Rural,Motor Vehicle In-Transport,Not an Intersection,Dark - Not Lighted,...,Cloudy,,No Obstruction Noted,1.0,3.0,1.0,26.0,Female,Driver of a Motor Vehicle In-Transport,No (Alcohol Not Involved)
9,Alabama,10004,1,1,0,1,Rural,Pedestrian,Not an Intersection,Dark - Not Lighted,...,Clear,,No Obstruction Noted,1.0,1.0,1.0,41.0,Male,Driver of a Motor Vehicle In-Transport,No (Alcohol Not Involved)


<a id="id6"></a>

## Checking Column data types

In [458]:
### changing some int64 into categories
df_merged_accident['STATE'] = df_merged_accident['STATE'].astype('category')
df_merged_accident['ST_CASE'] = df_merged_accident['ST_CASE'].astype('category')
df_merged_accident['STATENAME'] = df_merged_accident['STATENAME'].astype('category')
df_merged_accident['DRUNK_DR'] = df_merged_accident['DRUNK_DR'].astype('category')
df_merged_accident['RUR_URBNAME'] = df_merged_accident['RUR_URBNAME'].astype('category')
df_merged_accident.dtypes

STATENAME       category
ST_CASE         category
VEH_NO             int64
FATALS             int64
DRUNK_DR        category
PEDS               int64
RUR_URBNAME     category
HARM_EVNAME       object
TYP_INTNAME       object
LGT_CONDNAME      object
WEATHER1NAME      object
WEATHER2NAME      object
WEATHERNAME       object
MFACTORNAME       object
MVISOBSCNAME      object
STATE           category
VE_FORMS         float64
PER_NO           float64
AGE              float64
SEXNAME           object
PER_TYPNAME       object
DRINKINGNAME      object
dtype: object

In [459]:
# save the data to a new csv file
df_merged_accident.to_csv('../data/processed/accident_merged_data.csv')


In [460]:
### Total fatalities
table_accident['FATALS'].sum()

36355