# Chicago Car Crashes Project

## Business Understanding

### One of the purposes of having vehicle crash data is to; Better public safety, Improve urban planning, and Improve policy making

### Some of the possible business questions that can be derived from the data include, 

#### 1. What factors contribute most to severe crashes?
#### 2. Which locations, times, and conditions are accident prone?
#### 3. Are certain groups more vulnerable to crashes?
#### 4. Which vehcile types are most involved in severe or fatal crashes?
#### 5. How do some behaviors impact crashes e.g seatbelt use, intoxication, or distractions affect injury severity?
#### 6. How can the data being assessed be used to assist the police, hospitals, and city planners target interventions? 

## Problem Statement
#### Pinpoint crash hotspots in Chicago and understand contributing factors to assists city planners and law enforcement to minimize accidents.

## Metric for Success

### Sucessfully answering the above business questions will be a a significant advantage.
### Another metric will be making a hotspot analysis that accurately pinpoints high_risk zones for crashes. 
## Real World Use Case
### could be in assisting governments and city planners on the regions that they need to install cameras, improve lighting, and redesign road structures in hotspots.


### Considering that lives are involved and this is my first official model, an accuracy of 80% will be considered sufficient.

## Data Understanding



In [88]:
#import the libraries
import pandas as pd 
import numpy as np 
import seaborn as sns 
import matplotlib.pyplot as plt 
%matplotlib inline
import warnings 
warnings.filterwarnings("ignore")

#import sklearn libraries
from sklearn.preprocessing import LabelEncoder, OneHotEncoder, StandardScaler, MinMaxScaler
from sklearn.model_selection import train_test_split
from imblearn.over_sampling import SMOTE, SMOTEN
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import GridSearchCV, RandomizedSearchCV
from scipy.stats import randint
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, confusion_matrix, classification_report,roc_auc_score, roc_curve
from sklearn.pipeline import Pipeline

In [89]:
# load the datasets
people_data = pd.read_csv("Traffic_Crashes_People.csv", low_memory=False)
vehicles_data = pd.read_csv("Traffic_Crashes_Vehicles.csv", low_memory=False)
crashes_data = pd.read_csv("Traffic_Crashes_Crashes.csv", low_memory=False)

In [90]:
# checking the people dataset
people_data.head()

Unnamed: 0,PERSON_ID,PERSON_TYPE,CRASH_RECORD_ID,VEHICLE_ID,CRASH_DATE,SEAT_NO,CITY,STATE,ZIPCODE,SEX,...,EMS_RUN_NO,DRIVER_ACTION,DRIVER_VISION,PHYSICAL_CONDITION,PEDPEDAL_ACTION,PEDPEDAL_VISIBILITY,PEDPEDAL_LOCATION,BAC_RESULT,BAC_RESULT VALUE,CELL_PHONE_USE
0,O2146003,DRIVER,95be268b1aaa3b632ce13055264f4c8e2c304ba6d8bdae...,2045728.0,09/07/2025 11:30:00 PM,,UNKNOWN,XX,,X,...,,UNKNOWN,UNKNOWN,UNKNOWN,,,,TEST NOT OFFERED,,
1,O2146009,DRIVER,bd1f28c916dfca267392818a032e26431679d1e6572495...,2045733.0,09/07/2025 11:30:00 PM,,MIAMI GARGENS,FL,33056.0,M,...,,FAILED TO YIELD,NOT OBSCURED,NORMAL,,,,TEST NOT OFFERED,,
2,O2146010,DRIVER,bd1f28c916dfca267392818a032e26431679d1e6572495...,2045748.0,09/07/2025 11:30:00 PM,,CHICAGO,IL,60629.0,M,...,,NONE,NOT OBSCURED,NORMAL,,,,TEST NOT OFFERED,,
3,O2146013,DRIVER,a397b6e3872a2695dd20df2318c748fe5e02ba3662b9c0...,2045737.0,09/07/2025 11:26:00 PM,,,,,M,...,,UNKNOWN,UNKNOWN,UNKNOWN,,,,TEST NOT OFFERED,,
4,O2146014,DRIVER,a397b6e3872a2695dd20df2318c748fe5e02ba3662b9c0...,2045745.0,09/07/2025 11:26:00 PM,,WAUSAU,WI,54401.0,M,...,,NONE,NOT OBSCURED,NORMAL,,,,TEST NOT OFFERED,,


In [91]:
# checking the vehicles dataset
vehicles_data.head()

Unnamed: 0,CRASH_UNIT_ID,CRASH_RECORD_ID,CRASH_DATE,UNIT_NO,UNIT_TYPE,NUM_PASSENGERS,VEHICLE_ID,CMRC_VEH_I,MAKE,MODEL,...,TRAILER1_LENGTH,TRAILER2_LENGTH,TOTAL_VEHICLE_LENGTH,AXLE_CNT,VEHICLE_CONFIG,CARGO_BODY_TYPE,LOAD_TYPE,HAZMAT_OUT_OF_SERVICE_I,MCS_OUT_OF_SERVICE_I,HAZMAT_CLASS
0,2146003,95be268b1aaa3b632ce13055264f4c8e2c304ba6d8bdae...,09/07/2025 11:30:00 PM,1,DRIVER,,2045728.0,,UNKNOWN,OTHER (EXPLAIN IN NARRATIVE),...,,,,,,,,,,
1,2146004,95be268b1aaa3b632ce13055264f4c8e2c304ba6d8bdae...,09/07/2025 11:30:00 PM,2,PARKED,,2045729.0,,DODGE,CARAVAN,...,,,,,,,,,,
2,2146009,bd1f28c916dfca267392818a032e26431679d1e6572495...,09/07/2025 11:30:00 PM,1,DRIVER,,2045733.0,,VOLKSWAGEN,JETTA,...,,,,,,,,,,
3,2146010,bd1f28c916dfca267392818a032e26431679d1e6572495...,09/07/2025 11:30:00 PM,2,DRIVER,,2045748.0,,YAMAHA,YAMAHA,...,,,,,,,,,,
4,2146013,a397b6e3872a2695dd20df2318c748fe5e02ba3662b9c0...,09/07/2025 11:26:00 PM,1,DRIVER,,2045737.0,,UNKNOWN,OTHER (EXPLAIN IN NARRATIVE),...,,,,,,,,,,


In [92]:
# checking the vehicles dataset
crashes_data.head()

Unnamed: 0,CRASH_RECORD_ID,CRASH_DATE_EST_I,CRASH_DATE,POSTED_SPEED_LIMIT,TRAFFIC_CONTROL_DEVICE,DEVICE_CONDITION,WEATHER_CONDITION,LIGHTING_CONDITION,FIRST_CRASH_TYPE,TRAFFICWAY_TYPE,...,INJURIES_NON_INCAPACITATING,INJURIES_REPORTED_NOT_EVIDENT,INJURIES_NO_INDICATION,INJURIES_UNKNOWN,CRASH_HOUR,CRASH_DAY_OF_WEEK,CRASH_MONTH,LATITUDE,LONGITUDE,LOCATION
0,9c1182f668ea7605b7a37aaecdb2350fbc625a475d60b3...,,09/09/2025 11:55:00 PM,30,TRAFFIC SIGNAL,FUNCTIONING PROPERLY,CLEAR,"DARKNESS, LIGHTED ROAD",REAR TO FRONT,DIVIDED - W/MEDIAN (NOT RAISED),...,0.0,0.0,2.0,0.0,23,3,9,41.735635,-87.68265,POINT (-87.682650342555 41.735635328819)
1,31397733ed36babc51ab6de3c99d7950d90a1439e64ce4...,,09/09/2025 11:30:00 PM,30,NO CONTROLS,NO CONTROLS,CLEAR,UNKNOWN,PARKED MOTOR VEHICLE,ONE-WAY,...,0.0,0.0,1.0,0.0,23,3,9,41.965614,-87.766979,POINT (-87.766979452073 41.965614275418)
2,4d5883a274332f0b0e71eb5fb4684e0ec3f580374847ae...,,09/09/2025 11:18:00 PM,15,NO CONTROLS,NO CONTROLS,CLEAR,"DARKNESS, LIGHTED ROAD",FIXED OBJECT,PARKING LOT,...,1.0,0.0,0.0,0.0,23,3,9,41.920299,-87.670814,POINT (-87.670814157004 41.920299243516)
3,bcb79b90f48dab8b36a40e50271104b68dc1d1d72425d8...,,09/09/2025 10:11:00 PM,30,TRAFFIC SIGNAL,FUNCTIONING PROPERLY,CLEAR,"DARKNESS, LIGHTED ROAD",TURNING,NOT DIVIDED,...,0.0,0.0,1.0,0.0,22,3,9,41.917052,-87.755962,POINT (-87.755962192591 41.91705242361)
4,d24363839c4baaf316030a414d5f285c15b21aff799a45...,,09/09/2025 09:51:00 PM,20,NO CONTROLS,NO CONTROLS,CLEAR,DARKNESS,PARKED MOTOR VEHICLE,NOT DIVIDED,...,0.0,0.0,1.0,0.0,21,3,9,41.846225,-87.722773,POINT (-87.722772758962 41.846224849448)


In [93]:
# checking the shape on both datasets
print(f"The people dataset has a shape of {people_data.shape}, the vehicles dataset has a shape of {vehicles_data.shape}, the crashes dataset has a shape of {crashes_data.shape}.")

The people dataset has a shape of (2157293, 29), the vehicles dataset has a shape of (2003473, 71), the crashes dataset has a shape of (983027, 48).


In [94]:
# Checking the column names in both datasets
print(people_data.columns)

print(vehicles_data.columns)

print(crashes_data.columns)

Index(['PERSON_ID', 'PERSON_TYPE', 'CRASH_RECORD_ID', 'VEHICLE_ID',
       'CRASH_DATE', 'SEAT_NO', 'CITY', 'STATE', 'ZIPCODE', 'SEX', 'AGE',
       'DRIVERS_LICENSE_STATE', 'DRIVERS_LICENSE_CLASS', 'SAFETY_EQUIPMENT',
       'AIRBAG_DEPLOYED', 'EJECTION', 'INJURY_CLASSIFICATION', 'HOSPITAL',
       'EMS_AGENCY', 'EMS_RUN_NO', 'DRIVER_ACTION', 'DRIVER_VISION',
       'PHYSICAL_CONDITION', 'PEDPEDAL_ACTION', 'PEDPEDAL_VISIBILITY',
       'PEDPEDAL_LOCATION', 'BAC_RESULT', 'BAC_RESULT VALUE',
       'CELL_PHONE_USE'],
      dtype='object')
Index(['CRASH_UNIT_ID', 'CRASH_RECORD_ID', 'CRASH_DATE', 'UNIT_NO',
       'UNIT_TYPE', 'NUM_PASSENGERS', 'VEHICLE_ID', 'CMRC_VEH_I', 'MAKE',
       'MODEL', 'LIC_PLATE_STATE', 'VEHICLE_YEAR', 'VEHICLE_DEFECT',
       'VEHICLE_TYPE', 'VEHICLE_USE', 'TRAVEL_DIRECTION', 'MANEUVER',
       'TOWED_I', 'FIRE_I', 'OCCUPANT_CNT', 'EXCEED_SPEED_LIMIT_I', 'TOWED_BY',
       'TOWED_TO', 'AREA_00_I', 'AREA_01_I', 'AREA_02_I', 'AREA_03_I',
       'AREA_04_I', 'ARE

#### From the above, the vehicles dataset has a lot more columns than the peoples dataset. 
#### However, I will still try to merge the two datasets so that I can start working from a single dataset.

In [None]:
# The approach for joining the two datasets will be to join on the crash record ID and vehicle ID column since they are similar in both sets.
# This approach will help in minimizing any cases of duplicates

# The first part of this approach is to standardise the columns to avoid any mismatches that may occur
people_data.columns = people_data.columns.str.lower()
vehicles_data.columns = vehicles_data.columns.str.lower() 
crashes_data.columns = crashes_data.columns.str.lower()
# Next, merging on crash record ID and vehcile ID columns


# Disclaimer, as a result of the meagre computing power on my pc, I decided to go with inner merge. 
people_vehicles = pd.merge(
    people_data,
    vehicles_data,
    how="inner",
    on=["crash_record_id", "vehicle_id"]
)

# Merging the third dataset
merged_df = pd.merge(    
    people_vehicles,
    crashes_data,
    how="inner",
    on="crash_record_id"
)


print (merged_df.shape)

In [None]:
# Checking the first 5 columns on the merged data
merged_df.head()

In [None]:
merged_df.columns

In [None]:
# Checking the percentage of missing data
missing_data = merged_df.isnull().mean()*100

#  Sorting the missing data in descending order
missing_data = missing_data.sort_values(ascending=False)

print(missing_data)

In [None]:
# Now considering the number of rows and columns, I will be dropping the columns with more than 70% missing data. 
# The 70% figure is based on the fact that there is too much missing data for that column to be useful.

# Threshold (70%)
threshold = 0.7

# Drop columns where more than 70% values are missing
merged_df = merged_df.loc[:, merged_df.isnull().mean() < threshold]

# Checking the percentage of missing data
missing_data = merged_df.isnull().mean()*100

#  Sorting the missing data in descending order
missing_data = missing_data.sort_values(ascending=False)

print(missing_data)

In [None]:
# Checking the data's new shape
merged_df.shape

In [None]:
merged_df.head()

In [None]:
# Print all column names to check for typos or different names
print(merged_df.columns)

In [None]:
# Now, for the remaining columns, I will try to use relationships and columns to fill in the missing data

corrs = merged_df.corr(numeric_only=True)["num_passengers"].sort_values(ascending=False)

print(corrs) 

In [None]:
# Since there is a high correlation with occupant_cnt, I will start by filling in that column, which will the be used to fill in the num_passengers column

# Filling in occupant_cnt using mode
mode_value = merged_df["occupant_cnt"].mode()[0]

merged_df["occupant_cnt"] = merged_df["occupant_cnt"].fillna(mode_value)

# check if it took
merged_df["occupant_cnt"].isnull().sum()

In [None]:
# using the occupant cnt column to fill in the num passengers

merged_df["num_passengers"] = merged_df.apply(
    lambda row: row["occupant_cnt"] - 1
    if pd.isna(row["num_passengers"]) else row["num_passengers"], 
    axis=1
)

# check if it took 
merged_df["num_passengers"].isnull().sum() 

In [None]:
# Filling in the data from columns
columns_to_fill_with_mode = [
    "drivers_license_class", "drivers_license_state", "zipcode", "age","city", "state", "driver_vision", "driver_action", "bac_result",
    "physical_condition", "vehicle_year", "lic_plate_state", "first_contact_point","model", "make", "occupant_cnt", "maneuver", "travel_direction",
    "vehicle_use", "vehicle_type", "vehicle_defect", "vehicle_id","airbag_deployed", "sex", "ejection", "safety_equipment", "injury_classification",
    "unit_type","report_type","location","longitude","latitude","most_severe_injury","beat_of_occurrence","street_direction","street_name" 
]

# Looping through and filling each with the mode 

for col in columns_to_fill_with_mode:
    mode_value = merged_df[col].mode()[0]   # get most common value
    merged_df[col] = merged_df[col].fillna(mode_value)

# Check one of them
print("Missing drivers_license_class:", merged_df["drivers_license_class"].isnull().sum())

In [None]:
# checking for columns in missing data
missing_data2 = merged_df.isnull().mean()*100

#  Sorting the missing data in descending order
missing_data2 = missing_data2.sort_values(ascending=False)

print(missing_data2)


In [None]:
#confirm the imputation
merged_df.isna().sum().any()

In [None]:
#check duplicates
merged_df.duplicated().sum()

# Data Preparation


In [None]:
# Making a copy of the cleaned dataset

cleaned_data = merged_df.copy(deep=True)
cleaned_data.shape

In [None]:
cleaned_data.columns

In [None]:
cleaned_data.head()

In [None]:
# checking for outliers
sns.boxplot(cleaned_data,color="r")
plt.tight_layout()
plt.grid(alpha=.3)
plt.xticks(rotation=45)
plt.show();

From the plot above, the vehicle_Id and crash_unit are unique identifiers assigned to each record 

# Exploratory Data Analysis


In [None]:
# Considering the size of the dataset, the next step will be to get a sample of 10% of the data of randomly selected rows to use for analysis
sample_data = cleaned_data.sample(frac=0.1, random_state=42)
sample_data.shape


##### The above creates a sample size of 200K which allows me to conduct my analysis in a way that my P.C can handle.

In [None]:
sample_data.head()

In [None]:
# Considering my problem statement and maintaining the same, EDA will focus along those lines

# Making sure that the 'latitude' and 'longitude' columns do not have any unfilled sections
df_spatial = sample_data.dropna(subset=['latitude', 'longitude'])

# To make sure that both the latitude and longitude only cover the Chicago region
df_spatial = df_spatial[
    (df_spatial['latitude'].between(41.6, 42.1)) &
    (df_spatial['longitude'].between(-87.95, -87.5))
]

print(f"Remaining rows after filtering: {len(df_spatial)}")

In [None]:
# Static density visualization which shows a static heatmap where the crashes are concentrated

plt.figure(figsize=(8, 8))
plt.hexbin(df_spatial['longitude'], df_spatial['latitude'], gridsize=100, cmap='Reds', bins='log')
plt.colorbar(label='log(crash count)')
plt.title("Crash Density Across Chicago")
plt.xlabel("Longitude")
plt.ylabel("Latitude")
plt.show()

In [None]:
# Hotspot clustering using DBSCAN
from sklearn.cluster import DBSCAN
coords = df_spatial[['latitude','longitude']].to_numpy()

# scaling for DBSCAN
coords_scaled = StandardScaler().fit_transform(coords)

db = DBSCAN(eps=0.05, min_samples=30).fit(coords_scaled)  
df_spatial['cluster'] = db.labels_

print(df_spatial['cluster'].value_counts().head())

In [None]:
# Ranking the hotspots
hotspots = df_spatial.groupby('cluster').agg(
    crashes=('crash_record_id', 'count'),
    total_injuries=('injuries_total','sum'),
    fatalities=('injuries_fatal','sum')
).sort_values('crashes', ascending=False)

# Checking the order of the hostspots
hotspots.head(10)

In [None]:
# Visualizing the cluster on a map with the crashes being color-coded according to the clusters
import folium
m_clusters = folium.Map(location=[41.85, -87.65], zoom_start=11)

for _, row in df_spatial.iterrows():
    if row['cluster'] != -1:
        folium.CircleMarker(
            location=[row['latitude'], row['longitude']],
            radius=2,
            color=f"#{(hash(row['cluster']) & 0xFFFFFF):06x}",  # cluster color
            fill=True,
            fill_opacity=0.6
        ).add_to(m_clusters)

m_clusters

# Univariate Analysis

In [None]:
# These lines of code are meant to structure the relevant variables according to the hotspot issues instead of brutforcing all the columns 
# and data which are quite substantial

# Detailing the appropriate code for making the necessary plots for the rest of the workflow.

def plot_categorical(series, top_n=10):
    counts = series.value_counts().head(top_n)
    sns.barplot(x=counts.values, y=counts.index)
    plt.title(series.name)
    plt.show();

def plot_numeric(series, bins=20):
    sns.histplot(series.dropna(), bins=bins, kde=False)
    plt.title(series.name)
    plt.show();


In [None]:
# Getting the time when ost crashes occur
plot_numeric(sample_data['crash_hour'])


From the above, the distribution is according to the period or hour when most crashes tend to occur with most of the crashes happening at 3pm followed by 7 am, and around midnight. This information can help us better explain some of the reasons being the 3pm jam that occurs as most people are leaving work, conducting their errands, and picking children up from school. The 7AM crashes can be explained as the early morning bustle as people try to go to work and school.

In [None]:
# Map the days of the week if stored as numbers
day_map = {
    1: "Monday", 2: "Tuesday", 3: "Wednesday",
    4: "Thursday", 5: "Friday", 6: "Saturday", 7: "Sunday"
}
sample_data['crash_day_of_week'] = sample_data['crash_day_of_week'].map(day_map)

# Ensure categorical order
days_order = ["Monday", "Tuesday", "Wednesday", "Thursday", "Friday", "Saturday", "Sunday"]
sample_data['crash_day_of_week'] = pd.Categorical(
    sample_data['crash_day_of_week'],
    categories=days_order,
    ordered=True
)

# Plot
sns.countplot(data=sample_data, y='crash_day_of_week', order=days_order)
plt.title("Crashes by Day of Week")
plt.show()


In [None]:
# Month map: 1-12 → Month names
month_map = {
    1: "January", 2: "February", 3: "March", 4: "April",
    5: "May", 6: "June", 7: "July", 8: "August",
    9: "September", 10: "October", 11: "November", 12: "December"
}

# Apply mapping
sample_data['crash_month'] = sample_data['crash_month'].map(month_map)

# Chronological order (Jan → Dec)
month_order = list(month_map.values())

# Plot with order
sns.countplot(
    data=merged_df,
    x='crash_month',
    order=month_order,
    palette="crest"
)
plt.title("Crashes by Month")
plt.xlabel("Month")
plt.ylabel("Number of Crashes")
plt.xticks(rotation=45)
plt.show()


In [None]:
# Consideirng the deverity of the crashes
# The most severe injuries
plot_categorical(sample_data['most_severe_injury'])

In [None]:
plot_categorical(sample_data['damage'])

In [None]:
# An assessment of the weather conditions and when most accidents occured

plot_categorical(merged_df['weather_condition'])

In [None]:
# An assessment of the lighting conditions when most accidents occured

plot_categorical(merged_df['lighting_condition'])

In [None]:
# The existence of traffic controls whenever accidents occurred

plot_categorical(merged_df['traffic_control_device'])

In [None]:
# The road surface conditions whenever accidents occured and when most accidents occurred

plot_categorical(merged_df['roadway_surface_cond'])

In [None]:


plot_categorical(merged_df['prim_contributory_cause'])

In [None]:
# An assessment of the gender that led to the most accidents

plot_categorical(merged_df['sex'])

In [None]:
# Getting an assessment of the ages of people involved in the accidents

plot_numeric(merged_df[merged_df['age'] > 0]['age'], bins=10)

In [None]:
# Instances when the people involved in accidents had worn safety equipment

plot_categorical(merged_df['safety_equipment'])

In [None]:
# Whether or not BAC tests were administered

plot_categorical(merged_df['bac_result'])

In [None]:
# The type of vehicle involved in an accident

plot_categorical(merged_df['vehicle_type'])

In [None]:
# Whether or not the vehicle had any defects

plot_categorical(merged_df['vehicle_defect'])

# Bivariate Analysis

## An assessment of the relationship between features.
### For this section I will majorly focus on elements that majorly align with my problem statement.

In [None]:
# Age vs injury classification 

plt.figure(figsize=(10,6))
sns.boxplot(data=sample_data, x='injury_classification', y='age')
plt.title("Driver Age vs Injury Classification")
plt.xticks(rotation=45)
plt.tight_layout()
plt.show()

In [None]:
# An assessment of the vehicle type vs injury classification
plt.figure(figsize=(12,6))
sns.countplot(data=sample_data, x='vehicle_type', hue='injury_classification',
              order=sample_data['vehicle_type'].value_counts().iloc[:10].index)
plt.xticks(rotation=45)
plt.title("Vehicle Type vs Injury Classification")
plt.show()

In [None]:
# An assessment of the relationship between alcoholism and injury
plt.figure(figsize=(8,5))
sns.countplot(data=sample_data, x='bac_result', hue='injury_classification')
plt.title("ALCOHOL (BAC) Result vs Injury Classification")
plt.xticks(rotation=45)
plt.show()

In [None]:
# Your crash_hour is already in the right format - just use it directly!
simple_grouped = merged_df.groupby('crash_hour').size()

# Create the plot
plt.figure(figsize=(12, 6))
simple_grouped.plot(kind='bar')
plt.title('Traffic Crashes by Hour of Day')
plt.xlabel('Hour (24-hour format)')
plt.ylabel('Number of Crashes')
plt.xticks(rotation=0)
plt.tight_layout()
plt.show()

# Multivariate Analysis
### Interactions between multiple variables



In [None]:
# Considering Vehicle Year, Age, and Injury

plt.figure(figsize=(12,6))
sns.scatterplot(data=sample_data, x='vehicle_year', y='age', hue='injury_classification', alpha=0.6)
plt.title("Vehicle Year vs Age colored by Injury Classification")
plt.show();

# A correlation Heatmap of the numerical features

In [None]:
plt.figure(figsize=(10,7))
sns.heatmap(sample_data[['age','num_passengers','vehicle_year','occupant_cnt']].corr(), annot=True, cmap="coolwarm")
plt.title("Correlation Heatmap of Numerical Features")
plt.show();