# ELT

## Objectives

* Load raw data, inspect and clean missing data and duplicated rows. Check data types for any issues.
* To prepare data for subssampling

## Inputs

* The dataset, "US_Accidents_March23_sampled_500k.csv", which is saved locally in "Data/Raw/US_Accidents"

## Outputs

* Cleaned csv file "US_Accidents_For_Subsampling.csv"

## Steps Carried Out

* Load data
* Checked data types
* Dropped unnecessary columns
* Dropped rows with missing values
* Dropped rows with "N/A Precipitation"
* Checked "ID" values are unque before dropping
* Dropped duplicated rows
* Save to csv file

---

# Change working directory

* We are assuming you will store the notebooks in a subfolder, therefore when running the notebook in the editor, you will need to change the working directory

We need to change the working directory from its current folder to its parent folder
* We access the current directory with os.getcwd()

In [1]:
import os
current_dir = os.getcwd()
current_dir

'c:\\Users\\sonia\\Documents\\VS Studio Projects\\US_Accidents_ML_Project\\jupyter_notebooks'

We want to make the parent of the current directory the new current directory
* os.path.dirname() gets the parent directory
* os.chir() defines the new current directory

In [2]:
os.chdir(os.path.dirname(current_dir))
print("You set a new current directory")

You set a new current directory


Confirm the new current directory

In [3]:
current_dir = os.getcwd()
current_dir

'c:\\Users\\sonia\\Documents\\VS Studio Projects\\US_Accidents_ML_Project'

---

## Required Libraries

In [4]:
import pandas as pd
import numpy as np

---

## Load Raw Dataset

I will use Pandas to load the data into a DataFrame (df).

In [5]:
df = pd.read_csv("Data/Raw/US_Accidents/US_Accidents_March23_sampled_500k.csv")
pd.set_option("display.max_columns", None)
df

Unnamed: 0,ID,Source,Severity,Start_Time,End_Time,Start_Lat,Start_Lng,End_Lat,End_Lng,Distance(mi),Description,Street,City,County,State,Zipcode,Country,Timezone,Airport_Code,Weather_Timestamp,Temperature(F),Wind_Chill(F),Humidity(%),Pressure(in),Visibility(mi),Wind_Direction,Wind_Speed(mph),Precipitation(in),Weather_Condition,Amenity,Bump,Crossing,Give_Way,Junction,No_Exit,Railway,Roundabout,Station,Stop,Traffic_Calming,Traffic_Signal,Turning_Loop,Sunrise_Sunset,Civil_Twilight,Nautical_Twilight,Astronomical_Twilight
0,A-2047758,Source2,2,2019-06-12 10:10:56,2019-06-12 10:55:58,30.641211,-91.153481,,,0.000,Accident on LA-19 Baker-Zachary Hwy at Lower Z...,Highway 19,Zachary,East Baton Rouge,LA,70791-4610,US,US/Central,KBTR,2019-06-12 09:53:00,77.0,77.0,62.0,29.92,10.0,NW,5.0,0.00,Fair,False,False,False,False,False,False,False,False,False,False,False,True,False,Day,Day,Day,Day
1,A-4694324,Source1,2,2022-12-03 23:37:14.000000000,2022-12-04 01:56:53.000000000,38.990562,-77.399070,38.990037,-77.398282,0.056,Incident on FOREST RIDGE DR near PEPPERIDGE PL...,Forest Ridge Dr,Sterling,Loudoun,VA,20164-2813,US,US/Eastern,KIAD,2022-12-03 23:52:00,45.0,43.0,48.0,29.91,10.0,W,5.0,0.00,Fair,False,False,False,False,False,False,False,False,False,False,False,False,False,Night,Night,Night,Night
2,A-5006183,Source1,2,2022-08-20 13:13:00.000000000,2022-08-20 15:22:45.000000000,34.661189,-120.492822,34.661189,-120.492442,0.022,Accident on W Central Ave from Floradale Ave t...,Floradale Ave,Lompoc,Santa Barbara,CA,93436,US,US/Pacific,KLPC,2022-08-20 12:56:00,68.0,68.0,73.0,29.79,10.0,W,13.0,0.00,Fair,False,False,False,False,False,False,False,False,False,False,False,True,False,Day,Day,Day,Day
3,A-4237356,Source1,2,2022-02-21 17:43:04,2022-02-21 19:43:23,43.680592,-92.993317,43.680574,-92.972223,1.054,Incident on I-90 EB near REST AREA Drive with ...,14th St NW,Austin,Mower,MN,55912,US,US/Central,KAUM,2022-02-21 17:35:00,27.0,15.0,86.0,28.49,10.0,ENE,15.0,0.00,Wintry Mix,False,False,False,False,False,False,False,False,False,False,False,False,False,Day,Day,Day,Day
4,A-6690583,Source1,2,2020-12-04 01:46:00,2020-12-04 04:13:09,35.395484,-118.985176,35.395476,-118.985995,0.046,RP ADV THEY LOCATED SUSP VEH OF 20002 - 726 CR...,River Blvd,Bakersfield,Kern,CA,93305-2649,US,US/Pacific,KBFL,2020-12-04 01:54:00,42.0,42.0,34.0,29.77,10.0,CALM,0.0,0.00,Fair,False,False,False,False,False,False,False,False,False,False,False,False,False,Night,Night,Night,Night
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
499995,A-6077227,Source1,2,2021-12-15 07:30:00,2021-12-15 07:50:30,45.522510,-123.084104,45.520225,-123.084211,0.158,Stationary traffic on OR-47 from NW Martin Rd ...,Quince St,Forest Grove,Washington,OR,97116-2174,US,US/Pacific,KHIO,2021-12-15 07:14:00,40.0,32.0,77.0,29.55,10.0,SSE,15.0,0.01,Light Rain,False,False,False,False,False,False,False,False,False,False,False,False,False,Night,Day,Day,Day
499996,A-6323243,Source1,2,2021-12-19 16:25:00,2021-12-19 17:40:37,26.702570,-80.111169,26.703141,-80.111133,0.040,Incident on MILITARY TRL near WESTGATE AVE Dri...,N Military Trl,West Palm Beach,Palm Beach,FL,33409-4712,US,US/Eastern,KPBI,2021-12-19 16:53:00,78.0,78.0,87.0,29.94,10.0,SSE,13.0,0.01,Partly Cloudy,False,False,False,False,False,False,False,False,False,False,False,False,False,Day,Day,Day,Day
499997,A-3789256,Source1,2,2022-04-13 19:28:29,2022-04-13 21:33:44,34.561862,-112.259620,34.566822,-112.267150,0.549,Crash on the right shoulder on E SR-69 Northbo...,E AZ-69,Dewey,Yavapai,AZ,86327,US,US/Mountain,KPRC,2022-04-13 19:53:00,52.0,52.0,12.0,24.94,10.0,WSW,12.0,0.00,Fair,False,False,True,True,False,False,False,False,False,False,False,True,False,Night,Night,Day,Day
499998,A-7030381,Source1,3,2020-05-15 17:20:56,2020-05-15 17:50:56,38.406680,-78.619310,38.406680,-78.619310,0.000,At US-340/S Stuart Ave - Serious accident.,W Spotswood Trl,Elkton,Rockingham,VA,22827,US,US/Eastern,KSHD,2020-05-15 17:15:00,82.0,82.0,38.0,28.70,10.0,SSW,14.0,0.00,Fair,False,False,False,False,False,False,False,False,False,False,False,True,False,Day,Day,Day,Day


I can see this is a large dataset with 500,000 rows and 46 columns.

---

## Inital Inspection and Cleaning

I will look at what data types are present and if there are missing values.

In [6]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 500000 entries, 0 to 499999
Data columns (total 46 columns):
 #   Column                 Non-Null Count   Dtype  
---  ------                 --------------   -----  
 0   ID                     500000 non-null  object 
 1   Source                 500000 non-null  object 
 2   Severity               500000 non-null  int64  
 3   Start_Time             500000 non-null  object 
 4   End_Time               500000 non-null  object 
 5   Start_Lat              500000 non-null  float64
 6   Start_Lng              500000 non-null  float64
 7   End_Lat                279623 non-null  float64
 8   End_Lng                279623 non-null  float64
 9   Distance(mi)           500000 non-null  float64
 10  Description            499999 non-null  object 
 11  Street                 499309 non-null  object 
 12  City                   499981 non-null  object 
 13  County                 500000 non-null  object 
 14  State                  500000 non-nu

I can see there is a mixture of objects, booleans, floats and integers. At this point, all of the data types appear consistent with expectations.

I can also see that there is missing data from many of the variables. My plan is to first drop variables that I'm sure I won't need, and then delete rows with missing values, before looking for duplicates. I have chosen to delete rows rather than impute, because the dataset is far larger than needed, and I will subsample to make a smaller dataset of ~ 10,000 instances. In this circumstance, I believe it makes sense to preserve complete data and drop incomplete data.

In [7]:
df_clean = df.drop(columns=[
    "Source", "End_Lat", "End_Lng", "Description", "Street", "Zipcode", "Country",
    "Weather_Timestamp", "Civil_Twilight", "Nautical_Twilight", "Astronomical_Twilight"
])

I choose to drop these variables because I believe they will either be irrelevant or potentially lead to overfitting of the model.

In [9]:
df_clean.shape

(500000, 35)

I can see we have lost 11 columns.

I will look at how many missing values are in each column before dropping them.

In [8]:
df_clean.isna().sum()

ID                        0
Severity                  0
Start_Time                0
End_Time                  0
Start_Lat                 0
Start_Lng                 0
Distance(mi)              0
City                     19
County                    0
State                     0
Timezone                507
Airport_Code           1446
Temperature(F)        10466
Wind_Chill(F)        129017
Humidity(%)           11130
Pressure(in)           8928
Visibility(mi)        11291
Wind_Direction        11197
Wind_Speed(mph)       36987
Precipitation(in)    142616
Weather_Condition     11101
Amenity                   0
Bump                      0
Crossing                  0
Give_Way                  0
Junction                  0
No_Exit                   0
Railway                   0
Roundabout                0
Station                   0
Stop                      0
Traffic_Calming           0
Traffic_Signal            0
Turning_Loop              0
Sunrise_Sunset         1483
dtype: int64

I can see that I have missing values for "City", "Timezone", "Airport_Code", "Sunrise_Sunset" and all weather condition related variables. I will drop all rows with missing data.

In [10]:
df_clean = df_clean.dropna()

print(f"After dropping missing values: {df_clean.shape}")

After dropping missing values: (337617, 35)


I also noted that in "Weather_Condition", the term "N/A Precipitation" appears which is a placeholder for missing data. I will check how many rows contain this holder before cleaning them.

In [11]:
df_clean["Weather_Condition"].value_counts().get("N/A Precipitation", 0)

187

In [12]:
df_clean = df_clean[df_clean["Weather_Condition"] != "N/A Precipitation"]
df_clean["Weather_Condition"].value_counts().get("N/A Precipitation", 0)

0

Now I will check for duplicated rows. First, I will ensure all "ID" values are unique, and then drop this column.

In [13]:
df_clean["ID"].is_unique

True

In [14]:
df_clean = df_clean.drop(columns=["ID"])
df_clean.head()

Unnamed: 0,Severity,Start_Time,End_Time,Start_Lat,Start_Lng,Distance(mi),City,County,State,Timezone,Airport_Code,Temperature(F),Wind_Chill(F),Humidity(%),Pressure(in),Visibility(mi),Wind_Direction,Wind_Speed(mph),Precipitation(in),Weather_Condition,Amenity,Bump,Crossing,Give_Way,Junction,No_Exit,Railway,Roundabout,Station,Stop,Traffic_Calming,Traffic_Signal,Turning_Loop,Sunrise_Sunset
0,2,2019-06-12 10:10:56,2019-06-12 10:55:58,30.641211,-91.153481,0.0,Zachary,East Baton Rouge,LA,US/Central,KBTR,77.0,77.0,62.0,29.92,10.0,NW,5.0,0.0,Fair,False,False,False,False,False,False,False,False,False,False,False,True,False,Day
1,2,2022-12-03 23:37:14.000000000,2022-12-04 01:56:53.000000000,38.990562,-77.39907,0.056,Sterling,Loudoun,VA,US/Eastern,KIAD,45.0,43.0,48.0,29.91,10.0,W,5.0,0.0,Fair,False,False,False,False,False,False,False,False,False,False,False,False,False,Night
2,2,2022-08-20 13:13:00.000000000,2022-08-20 15:22:45.000000000,34.661189,-120.492822,0.022,Lompoc,Santa Barbara,CA,US/Pacific,KLPC,68.0,68.0,73.0,29.79,10.0,W,13.0,0.0,Fair,False,False,False,False,False,False,False,False,False,False,False,True,False,Day
3,2,2022-02-21 17:43:04,2022-02-21 19:43:23,43.680592,-92.993317,1.054,Austin,Mower,MN,US/Central,KAUM,27.0,15.0,86.0,28.49,10.0,ENE,15.0,0.0,Wintry Mix,False,False,False,False,False,False,False,False,False,False,False,False,False,Day
4,2,2020-12-04 01:46:00,2020-12-04 04:13:09,35.395484,-118.985176,0.046,Bakersfield,Kern,CA,US/Pacific,KBFL,42.0,42.0,34.0,29.77,10.0,CALM,0.0,0.0,Fair,False,False,False,False,False,False,False,False,False,False,False,False,False,Night


Now I will check for duplicated rows.

In [15]:
df_clean[df_clean.duplicated()]

Unnamed: 0,Severity,Start_Time,End_Time,Start_Lat,Start_Lng,Distance(mi),City,County,State,Timezone,Airport_Code,Temperature(F),Wind_Chill(F),Humidity(%),Pressure(in),Visibility(mi),Wind_Direction,Wind_Speed(mph),Precipitation(in),Weather_Condition,Amenity,Bump,Crossing,Give_Way,Junction,No_Exit,Railway,Roundabout,Station,Stop,Traffic_Calming,Traffic_Signal,Turning_Loop,Sunrise_Sunset
45814,2,2022-01-05 05:35:58,2022-01-05 07:05:58,38.842436,-77.007942,0.936,Washington,District of Columbia,DC,US/Eastern,KDCA,32.0,24.0,79.0,30.09,10.0,S,10.0,0.0,Cloudy,False,False,False,False,False,False,False,False,False,False,False,False,False,Night
51246,1,2020-05-03 21:13:00,2020-05-04 01:00:00,34.773870,-79.329460,0.000,Maxton,Robeson,NC,US/Eastern,KMEB,78.0,78.0,46.0,29.58,10.0,SSW,10.0,0.0,Fair,False,False,False,False,False,False,False,False,False,False,False,False,False,Night
52663,2,2020-12-06 19:06:22,2020-12-06 22:27:25,29.766078,-95.277722,0.222,Houston,Harris,TX,US/Central,KHOU,58.0,58.0,53.0,30.01,10.0,WNW,7.0,0.0,Fair,False,False,False,False,False,False,False,False,False,False,False,False,False,Night
84440,3,2020-04-14 15:19:27,2020-04-14 16:04:27,40.833020,-73.146390,0.000,Nesconset,Suffolk,NY,US/Eastern,KISP,54.0,54.0,34.0,29.90,10.0,WNW,5.0,0.0,Mostly Cloudy,False,False,True,False,False,False,False,False,False,False,False,False,False,Day
84491,2,2020-12-21 21:55:30,2020-12-21 23:33:00,38.802161,-77.510604,0.692,Manassas,Prince William,VA,US/Eastern,KHEF,40.0,40.0,97.0,29.53,10.0,CALM,0.0,0.0,Cloudy,False,False,False,False,False,False,False,False,False,False,False,False,False,Night
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
496472,2,2021-08-12 14:03:00,2021-08-12 15:46:12,39.534239,-121.575122,0.089,Oroville,Butte,CA,US/Pacific,KOVE,91.0,91.0,39.0,29.69,10.0,W,6.0,0.0,Fair,False,False,False,False,False,False,False,False,False,False,False,False,False,Day
497673,3,2020-04-29 16:30:52,2020-04-29 16:45:52,42.257320,-71.011320,0.000,Quincy,Norfolk,MA,US/Eastern,KMQE,54.0,54.0,37.0,30.26,10.0,E,12.0,0.0,Cloudy,False,False,False,False,False,False,False,False,False,False,False,True,False,Day
498399,2,2020-10-31 02:57:00,2020-10-31 05:52:19,25.528310,-80.392874,1.009,Homestead,Miami-Dade,FL,US/Eastern,KHST,78.0,78.0,96.0,29.98,10.0,NNW,3.0,0.0,Fair,False,False,False,False,False,False,False,False,False,False,False,False,False,Night
498530,2,2020-12-19 05:38:00,2020-12-19 09:45:00,34.955238,-78.757721,0.305,Fayetteville,Cumberland,NC,US/Eastern,KFAY,29.0,29.0,82.0,30.22,10.0,CALM,0.0,0.0,Fair,False,False,False,False,False,False,False,False,False,False,False,False,False,Night


I will remove these duplicates.

In [16]:
df_clean = df_clean.drop_duplicates()
df_clean.shape

(336972, 34)

I will have a last check of the DataFrame before saving df_clean as a csv file.

In [17]:
df_clean.info()

<class 'pandas.core.frame.DataFrame'>
Index: 336972 entries, 0 to 499999
Data columns (total 34 columns):
 #   Column             Non-Null Count   Dtype  
---  ------             --------------   -----  
 0   Severity           336972 non-null  int64  
 1   Start_Time         336972 non-null  object 
 2   End_Time           336972 non-null  object 
 3   Start_Lat          336972 non-null  float64
 4   Start_Lng          336972 non-null  float64
 5   Distance(mi)       336972 non-null  float64
 6   City               336972 non-null  object 
 7   County             336972 non-null  object 
 8   State              336972 non-null  object 
 9   Timezone           336972 non-null  object 
 10  Airport_Code       336972 non-null  object 
 11  Temperature(F)     336972 non-null  float64
 12  Wind_Chill(F)      336972 non-null  float64
 13  Humidity(%)        336972 non-null  float64
 14  Pressure(in)       336972 non-null  float64
 15  Visibility(mi)     336972 non-null  float64
 16  Wind_Di

---

## Save to CSV

df_clean will be saved to csv file and used for subsampling next.

In [18]:
df_clean.to_csv("Data/Cleaned/US_Accidents_For_Subsampling.csv", index=False)

---

## Conclusion and Next Steps

* The dataset has been cleaned of missing values and duplicated rows
* Data types have been checked
* The data is ready for subsampling to derive a dataset of ~10,000 instances and balanced classes