# **Feature Engineering**

## Objectives

* To feature engineer for future ML tasks

## Inputs

* The data file, "US_Accidents_For_Feature_Eng.csv", which is locally saved in "Data/Feature_Eng"

## Outputs

* The csv file, "US_Accidents_For_ML.csv", which is locally saved in "Data/ML"

## Summary of Steps

* If you have any additional comments that don't fit in the previous bullets, please state them here. 



---

# Change working directory

* We are assuming you will store the notebooks in a subfolder, therefore when running the notebook in the editor, you will need to change the working directory

We need to change the working directory from its current folder to its parent folder
* We access the current directory with os.getcwd()

In [1]:
import os
current_dir = os.getcwd()
current_dir

'c:\\Users\\sonia\\Documents\\VS Studio Projects\\US_Accidents_ML_Project\\jupyter_notebooks'

We want to make the parent of the current directory the new current directory
* os.path.dirname() gets the parent directory
* os.chir() defines the new current directory

In [2]:
os.chdir(os.path.dirname(current_dir))
print("You set a new current directory")

You set a new current directory


Confirm the new current directory

In [3]:
current_dir = os.getcwd()
current_dir

'c:\\Users\\sonia\\Documents\\VS Studio Projects\\US_Accidents_ML_Project'

---

## Required Libraries

In [4]:
import pandas as pd
import numpy as np

---

## Load the Dataset

I load the dataset using Pandas.

In [5]:
df = pd.read_csv("Data/Feature_Eng/US_Accidents_For_Feature_Eng.csv")
pd.set_option("display.max_columns", None)
df

Unnamed: 0,Severity,Start_Time,End_Time,Start_Lat,Start_Lng,Distance(mi),City,County,State,Timezone,Airport_Code,Temperature(F),Wind_Chill(F),Humidity(%),Pressure(in),Visibility(mi),Wind_Direction,Wind_Speed(mph),Precipitation(in),Weather_Condition,Amenity,Bump,Crossing,Give_Way,Junction,No_Exit,Railway,Roundabout,Station,Stop,Traffic_Calming,Traffic_Signal,Turning_Loop,Sunrise_Sunset,Clearance_Time(hr),Clearance_Class
0,2,2022-09-08 20:54:00,2022-09-09 23:06:21,32.456486,-93.774536,0.501,Shreveport,Caddo,LA,US/Central,KSHV,78.0,78.0,62.0,29.61,10.00,CALM,0.0,0.00,Fair,False,False,False,False,False,False,False,False,False,False,False,False,False,Night,26.205833,Very Long
1,2,2021-05-22 00:40:30,2021-05-25 09:56:58,36.804693,-76.189728,0.253,Virginia Beach,Virginia Beach,VA,US/Eastern,KORF,54.0,54.0,90.0,30.40,7.00,CALM,0.0,0.00,Fair,False,False,True,False,False,False,False,False,False,False,False,True,False,Night,81.274444,Very Long
2,2,2022-01-21 14:25:00,2023-01-21 16:10:00,29.895741,-90.090026,1.154,Marrero,Jefferson,LA,US/Pacific,KAUD,40.0,33.0,58.0,30.28,10.00,N,10.0,0.00,Mostly Cloudy,False,False,False,False,True,False,False,False,False,False,False,False,False,Day,8761.750000,Very Long
3,2,2020-11-27 00:44:00,2020-11-28 04:49:48,32.456459,-93.779709,0.016,Shreveport,Caddo,LA,US/Central,KSHV,62.0,62.0,75.0,29.80,10.00,SSE,8.0,0.00,Cloudy,False,False,False,False,False,False,False,False,False,False,False,False,False,Night,28.096667,Very Long
4,2,2020-09-21 12:07:00,2020-09-22 15:22:36,26.966433,-82.255414,0.057,Port Charlotte,Charlotte,FL,US/Eastern,KPGD,84.0,84.0,69.0,29.99,10.00,E,18.0,0.00,Cloudy,False,False,False,False,False,False,False,False,False,False,False,False,False,Day,27.260000,Very Long
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
9995,2,2021-06-10 00:18:00,2021-06-10 10:53:16,39.573795,-86.618947,4.314,Stilesville,Morgan,IN,US/Eastern,KIND,72.0,72.0,91.0,29.14,10.00,SE,5.0,0.00,Mostly Cloudy,False,False,False,False,False,False,False,False,False,False,False,False,False,Night,10.587778,Long
9996,2,2020-12-16 14:49:30,2020-12-16 22:48:00,40.001124,-75.342886,0.634,Bryn Mawr,Delaware,PA,US/Eastern,KLOM,26.0,15.0,92.0,29.81,0.75,ENE,13.0,0.00,Snow,False,False,False,False,False,False,False,False,False,False,False,False,False,Day,7.975000,Long
9997,2,2022-12-26 14:44:37,2022-12-27 00:16:30,34.988932,-85.493085,1.560,Guild,Marion,TN,US/Eastern,KCHA,33.0,33.0,35.0,29.51,10.00,SSW,3.0,0.00,Cloudy,False,False,False,False,False,False,False,False,False,False,False,False,False,Day,9.531389,Long
9998,2,2020-11-14 18:44:30,2020-11-15 03:54:02,40.773530,-74.034951,0.686,Union City,Hudson,NJ,US/Eastern,KNYC,49.0,49.0,41.0,30.13,10.00,CALM,0.0,0.00,Fair,False,False,True,False,False,False,False,False,True,True,False,True,False,Night,9.158889,Long


---

## Detailed Look at Categorical Variables

In this section, I'm going to have a detailed look at the categorical variables and decide which ones to modify, drop or carry forward. For numerical variables, I will similarly decide which variables to modify, drop or carry forward after EDA.

In [11]:
for col in df.select_dtypes(include="object").columns:
    print(f"{col}: {df[col].nunique()}")

Start_Time: 9791
End_Time: 9920
City: 2433
County: 754
State: 48
Timezone: 4
Airport_Code: 993
Wind_Direction: 23
Weather_Condition: 51
Sunrise_Sunset: 2
Clearance_Class: 4


First, I would like to strip "US/" from each value under "Timezone".

In [7]:
df["Timezone"].unique()

array(['US/Central', 'US/Eastern', 'US/Pacific', 'US/Mountain'],
      dtype=object)

In [10]:
df["Timezone"] = df["Timezone"].str.replace("US/", "", regex=False)
df["Timezone"].value_counts()

Timezone
Eastern     4298
Pacific     2521
Central     2447
Mountain     734
Name: count, dtype: int64

Next, I will take a look at the values in "Sunrise_Sunset" and ensure they are valid.

In [9]:
df["Sunrise_Sunset"]. value_counts()

Sunrise_Sunset
Day      6445
Night    3555
Name: count, dtype: int64

Similarly, I will take a look at values for "Wind_Direction".

In [12]:
df["Wind_Direction"].value_counts()

Wind_Direction
CALM        1770
S            799
N            628
E            595
W            574
SSE          515
NW           478
VAR          473
WNW          453
SW           448
SSW          448
NNW          429
WSW          427
SE           424
ENE          419
NNE          363
ESE          362
NE           359
North         14
East           6
Variable       6
South          5
West           5
Name: count, dtype: int64

I can see that we have values which are the same but recorded differently as shorthand or longhand, for example, "S" and "South". I will create a map to convert the longhand to shorthand version.

In [13]:
# define mapping
direction_map = {
    "Variable": "VAR",
    "South": "S",
    "North": "N",
    "West": "W",
    "East": "E"
}

# apply mapping
df["Wind_Direction"] = df["Wind_Direction"].replace(direction_map)

# check unique values again
df["Wind_Direction"].unique()

array(['CALM', 'N', 'SSE', 'E', 'S', 'WNW', 'NNW', 'SE', 'ENE', 'NW',
       'NE', 'WSW', 'NNE', 'SW', 'W', 'ESE', 'VAR', 'SSW'], dtype=object)

Next, I will look at values for "Weather_Condition".

In [14]:
df["Weather_Condition"].value_counts()

Weather_Condition
Fair                       4718
Cloudy                     1598
Mostly Cloudy              1308
Partly Cloudy               808
Light Rain                  453
Light Snow                  199
Fog                         160
Rain                        119
Haze                         83
Fair / Windy                 65
Cloudy / Windy               45
Heavy Rain                   41
Mostly Cloudy / Windy        37
Snow                         32
Thunder in the Vicinity      32
Smoke                        29
Overcast                     28
Light Drizzle                27
Thunder                      24
Wintry Mix                   24
T-Storm                      18
Light Rain / Windy           17
Partly Cloudy / Windy        16
Light Rain with Thunder      14
Light Snow / Windy           13
Heavy T-Storm                11
Heavy Snow                    9
Shallow Fog                   6
Light Freezing Rain           5
T-Storm / Windy               5
Mist                  

I am going to create a new column, "Weather_Simplified" to reduce and simplify the number of different types of weather conditions. I will create a csv file that maps each "Weather_Condition" to "Weather_Simplified" and then merge "Weather_Simplified" as a new column in DataFrame.  

"Weather_Simplified" will be constructed such that all types of rain or snow or fog etc. are grouped together. Mixed weather conditions were consistently found to be "Condition"/ "Windy". In this case, all simplified to "Condition" with "Windy" dropped, unless the condition was "Fair", "Cloudy" or "Mostly Cloudy", then it simplified to "Windy". This was to record the 'worst' of the mixed conditions, which is a subjective point of view and should be reviewed with the client. 

In [16]:
# Get unique simplified weather conditions
unique_conditions = df['Weather_Condition'].dropna().unique()

# Convert to a DataFrame for better Excel paste
unique_df = pd.DataFrame(unique_conditions, columns=['Weather_Condition'])

# Copy to clipboard
unique_df.to_clipboard(index=False)  # No index column
print("Copied to clipboard! You can now paste into Excel.")

Copied to clipboard! You can now paste into Excel.


In [17]:
weather_map = pd.read_csv("Data/Raw/Supporting_files/Weather_Condition_Map.csv")
weather_map.head()

Unnamed: 0,Weather_Condition,Weather_Simplified
0,Fair,Fair
1,Mostly Cloudy,Cloudy
2,Cloudy,Cloudy
3,Partly Cloudy,Rain
4,Light Rain,Rain


In [18]:
df = df.merge(weather_map, on="Weather_Condition", how="left")

# check if any Weather_Condition values didn't get mapped
unmapped = df[df["Weather_Simplified"].isna()]["Weather_Condition"].unique()

if len(unmapped) > 0:
    print("Warning: The following Weather_Condition values were not mapped:")
    print(unmapped)
else:
    print("All Weather_Condition values successfully mapped.")

All Weather_Condition values successfully mapped.


Lastly, I will look at the boolean variables and see if any should be dropped.

In [6]:
for col in df.select_dtypes(include="boolean").columns:
    print(f"{col}: {df[col].nunique()}")
    print(f"{df[col].unique()}")

Amenity: 2
[False  True]
Bump: 2
[False  True]
Crossing: 2
[False  True]
Give_Way: 2
[False  True]
Junction: 2
[False  True]
No_Exit: 2
[False  True]
Railway: 2
[False  True]
Roundabout: 1
[False]
Station: 2
[False  True]
Stop: 2
[False  True]
Traffic_Calming: 2
[False  True]
Traffic_Signal: 2
[False  True]
Turning_Loop: 1
[False]


---

## Conclusion and Next Steps

* In cases where you don't need to push files to Repo, you may replace this section with "Conclusions and Next Steps" and state your conclusions and next steps.