# **Subsampling the Dataset**

## Objectives

* To subsample the data to obtain a dataset of ~10,000 instances with balanced classes of the target

## Inputs

* The cleaned dataset, "US_Accidents_For_Subsampling.csv", which is saved locally in "Data/Cleaned"

## Outputs

* The subsampled csv file "US_Accidents_For_Feature_Eng.csv"

## Steps Carried Out

* If you have any additional comments that don't fit in the previous bullets, please state them here. 



---

# Change working directory

* We are assuming you will store the notebooks in a subfolder, therefore when running the notebook in the editor, you will need to change the working directory

We need to change the working directory from its current folder to its parent folder
* We access the current directory with os.getcwd()

In [1]:
import os
current_dir = os.getcwd()
current_dir

'c:\\Users\\sonia\\Documents\\VS Studio Projects\\US_Accidents_ML_Project\\jupyter_notebooks'

We want to make the parent of the current directory the new current directory
* os.path.dirname() gets the parent directory
* os.chir() defines the new current directory

In [2]:
os.chdir(os.path.dirname(current_dir))
print("You set a new current directory")

You set a new current directory


Confirm the new current directory

In [3]:
current_dir = os.getcwd()
current_dir

'c:\\Users\\sonia\\Documents\\VS Studio Projects\\US_Accidents_ML_Project'

---

## Required Libraries

In [4]:
import pandas as pd
import numpy as np

---

## Load the Dataset

I will load the cleaned dataset using Pandas.

In [6]:
df = pd.read_csv("Data/Cleaned/US_Accidents_For_Subsampling.csv")
pd.set_option("display.max_columns", None)
df

Unnamed: 0,Severity,Start_Time,End_Time,Start_Lat,Start_Lng,Distance(mi),City,County,State,Timezone,Airport_Code,Temperature(F),Wind_Chill(F),Humidity(%),Pressure(in),Visibility(mi),Wind_Direction,Wind_Speed(mph),Precipitation(in),Weather_Condition,Amenity,Bump,Crossing,Give_Way,Junction,No_Exit,Railway,Roundabout,Station,Stop,Traffic_Calming,Traffic_Signal,Turning_Loop,Sunrise_Sunset
0,2,2019-06-12 10:10:56,2019-06-12 10:55:58,30.641211,-91.153481,0.000,Zachary,East Baton Rouge,LA,US/Central,KBTR,77.0,77.0,62.0,29.92,10.0,NW,5.0,0.00,Fair,False,False,False,False,False,False,False,False,False,False,False,True,False,Day
1,2,2022-12-03 23:37:14.000000000,2022-12-04 01:56:53.000000000,38.990562,-77.399070,0.056,Sterling,Loudoun,VA,US/Eastern,KIAD,45.0,43.0,48.0,29.91,10.0,W,5.0,0.00,Fair,False,False,False,False,False,False,False,False,False,False,False,False,False,Night
2,2,2022-08-20 13:13:00.000000000,2022-08-20 15:22:45.000000000,34.661189,-120.492822,0.022,Lompoc,Santa Barbara,CA,US/Pacific,KLPC,68.0,68.0,73.0,29.79,10.0,W,13.0,0.00,Fair,False,False,False,False,False,False,False,False,False,False,False,True,False,Day
3,2,2022-02-21 17:43:04,2022-02-21 19:43:23,43.680592,-92.993317,1.054,Austin,Mower,MN,US/Central,KAUM,27.0,15.0,86.0,28.49,10.0,ENE,15.0,0.00,Wintry Mix,False,False,False,False,False,False,False,False,False,False,False,False,False,Day
4,2,2020-12-04 01:46:00,2020-12-04 04:13:09,35.395484,-118.985176,0.046,Bakersfield,Kern,CA,US/Pacific,KBFL,42.0,42.0,34.0,29.77,10.0,CALM,0.0,0.00,Fair,False,False,False,False,False,False,False,False,False,False,False,False,False,Night
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
336967,2,2021-12-15 07:30:00,2021-12-15 07:50:30,45.522510,-123.084104,0.158,Forest Grove,Washington,OR,US/Pacific,KHIO,40.0,32.0,77.0,29.55,10.0,SSE,15.0,0.01,Light Rain,False,False,False,False,False,False,False,False,False,False,False,False,False,Night
336968,2,2021-12-19 16:25:00,2021-12-19 17:40:37,26.702570,-80.111169,0.040,West Palm Beach,Palm Beach,FL,US/Eastern,KPBI,78.0,78.0,87.0,29.94,10.0,SSE,13.0,0.01,Partly Cloudy,False,False,False,False,False,False,False,False,False,False,False,False,False,Day
336969,2,2022-04-13 19:28:29,2022-04-13 21:33:44,34.561862,-112.259620,0.549,Dewey,Yavapai,AZ,US/Mountain,KPRC,52.0,52.0,12.0,24.94,10.0,WSW,12.0,0.00,Fair,False,False,True,True,False,False,False,False,False,False,False,True,False,Night
336970,3,2020-05-15 17:20:56,2020-05-15 17:50:56,38.406680,-78.619310,0.000,Elkton,Rockingham,VA,US/Eastern,KSHD,82.0,82.0,38.0,28.70,10.0,SSW,14.0,0.00,Fair,False,False,False,False,False,False,False,False,False,False,False,True,False,Day


---

## Create Target Column

The target for my future machine learning model will be "Clearance_Class". This is a column I will derive from the variables "Start_Time" and "End_Time", which currently have the data type as "object". First, I will change both data types to "datetime", then derive the column "Clearance_Time(hr) by subtracting "End_Time" from "Start_Time". Then I will derive "Clearance_Class" from "Clearance_Time(hr).  

To derive "Clearance_Time(hr)":

In [7]:
df["Start_Time"] = pd.to_datetime(df["Start_Time"], format="mixed")
df["End_Time"]   = pd.to_datetime(df["End_Time"], format="mixed")

df["Clearance_Time(hr)"] = (df["End_Time"] - df["Start_Time"]).dt.total_seconds() / 3600

df["Clearance_Time(hr)"]

0         0.750556
1         2.327500
2         2.162500
3         2.005278
4         2.452500
            ...   
336967    0.341667
336968    1.260278
336969    2.087500
336970    0.500000
336971    1.443056
Name: Clearance_Time(hr), Length: 336972, dtype: float64

---

NOTE

* You may add as many sections as you want, as long as it supports your project workflow.
* All notebook's cells should be run top-down (you can't create a dynamic wherein a given point you need to go back to a previous cell to execute some task, like go back to a previous cell and refresh a variable content)

---

# Push files to Repo

* In cases where you don't need to push files to Repo, you may replace this section with "Conclusions and Next Steps" and state your conclusions and next steps.